Hi,
Tried to upgrade any23 2.1 to 2.2 in nutch code base.
Changes:
1. src/plugin/any23/ivy.xml:
2. src/plugin/any23/plugin.xml
after "ant runtime",
below jar files are present in dir runtime/local/plugins/any23
any23.jar
apache-any23-api-2.2.jar
apache-any23-core-2.2.jar
apache-any23-csvutils-2.2.jar
apache-any23-encoding-2.2.jar
apache-any23-mime-2.2.jar
Did simple parse checker on a test html. Getting Errors as
1. java.util.concurrent.ExecutionException:
java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
Caused by: java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
2. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
...
Caused by: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
Entire log file is attached in debug.txt.
Regards,
Govind
2018-04-02 17:09:49,999 INFO parse.ParserChecker (ParserChecker.java:run(122))
- fetching: file:/tmp/exact_code.html
2018-04-02 17:09:50,205 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No
object cache found for conf=Configuration: core-default.xml, core-site.xml,
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,328 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No
object cache found for conf=Configuration: core-default.xml, core-site.xml,
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,366 TRACE file.File (FileResponse.java:(117)) -
fetching file:/tmp/exact_code.html
2018-04-02 17:09:50,450 INFO parse.ParseSegment
(ParseSegment.java:isTruncated(207)) - file:/tmp/exact_code.html skipped.
Content of size 79433 was truncated to 65536
2018-04-02 17:09:50,450 WARN parse.ParserChecker (ParserChecker.java:run(187))
- Content is truncated, parse may fail!
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: extractor,
extension-id: ir.co.bayan.simorq.zal.extractor.nutch.ExtractorParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-tika,
extension-id: org.apache.nutch.parse.tika.TikaParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-ext,
extension-id: ExtParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-html,
extension-id: org.apache.nutch.parse.html.HtmlParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-js,
extension-id: JSParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: feed,
extension-id: org.apache.nutch.parse.feed.FeedParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-swf,
extension-id: org.apache.nutch.parse.swf.SWFParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-zip,
extension-id: org.apache.nutch.parse.zip.ZipParser
2018-04-02 17:09:50,461 INFO parse.ParserFactory
(ParserFactory.java:matchExtensions(374)) - The parsing plugins:
[org.apache.nutch.parse.tika.TikaParser -
org.apache.nutch.parse.html.HtmlParser] are enabled via the plugin.includes
system property, and all claim to support the content type text/html, but they
are not mapped to it in the parse-plugins.xml file
2018-04-02 17:09:50,871 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) -
Parsing [file:/tmp/exact_code.html] with
[org.apache.nutch.parse.tika.TikaParser@693fe6c9]
2018-04-02 17:09:50,878 DEBUG tika.TikaParser (TikaParser.java:getParse(101)) -
Using Tika parser org.apache.tika.parser.html.HtmlParser for mime-type text/html
2018-04-02 17:09:51,205 TRACE tika.TikaParser (TikaParser.java:getParse(152)) -
Meta tags for file:/tmp/exact_code.html: base=null, noCache=false,
noFollow=false, noIndex=false, refresh=false, refreshHref=null
* general tags:
- viewport = width=device-width, initial-scale=1
- dc:title = I.F. on Kharms – Just a Beginning
- content-encoding = UTF-8
- generator = WordPress 4.9.4
- content-type = text/html; charset=UTF-8
- robots = index,follow
* http-equiv tags:
2018-04-02 17:09:51,206 TRACE tika.TikaParser (TikaParser.java:getParse(159)) -
Getting text...
2018-04-02 17:09:51,222 TRACE tika.TikaParser (TikaParser.java:getParse(165)) -
Getting title...
2018-04-02 17:09:51,224 TRACE tika.TikaParser (TikaParser.java:getParse(183)) -
Getting links (base URL = file:/tmp/exact_code.html) ...
2018-04-02 17:09:51,227 TRACE