Re: any23 2.2 upgrading in NUTCH gives errors

2018-04-02 Thread lewis john mcgibbney
Hi Govind,

Please scope out https://github.com/apache/nutch/pull/306
Let me know how things go.
Lewis

On Mon, Apr 2, 2018 at 4:45 AM, <user-digest-h...@nutch.apache.org> wrote:

>
>
> From: govind nitk <govind.n...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Mon, 2 Apr 2018 17:15:38 +0530
> Subject: any23 2.2 upgrading in NUTCH gives errors
> Hi,
>
> Tried to upgrade any23 2.1 to 2.2 in nutch code base.
>
> Changes:
> 1. src/plugin/any23/ivy.xml:
>  conf="*->default">
>
> 2. src/plugin/any23/plugin.xml
>
> 
> 
> 
> 
> 
>
>
> after "ant runtime",
> below jar files are present in dir runtime/local/plugins/any23
>
> any23.jar
> apache-any23-api-2.2.jar
> apache-any23-core-2.2.jar
> apache-any23-csvutils-2.2.jar
> apache-any23-encoding-2.2.jar
> apache-any23-mime-2.2.jar
>
>
>
>
> Did simple parse checker on a test html. Getting Errors as
> 1.  java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
> org/eclipse/rdf4j/common/lang/service/ServiceRegistry
>  
> Caused by: java.lang.NoClassDefFoundError: org/eclipse/rdf4j/common/lang/
> service/ServiceRegistry
>
> 2. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
> org/apache/any23/extractor/ExtractorRegistryImpl
> ...
> Caused by: java.lang.NoClassDefFoundError: org/apache/any23/extractor/
> ExtractorRegistryImpl
>
>
>
>
>
>


any23 2.2 upgrading in NUTCH gives errors

2018-04-02 Thread govind nitk
Hi,

Tried to upgrade any23 2.1 to 2.2 in nutch code base.

Changes:
1. src/plugin/any23/ivy.xml:


2. src/plugin/any23/plugin.xml








after "ant runtime",
below jar files are present in dir runtime/local/plugins/any23

any23.jar
apache-any23-api-2.2.jar
apache-any23-core-2.2.jar
apache-any23-csvutils-2.2.jar
apache-any23-encoding-2.2.jar
apache-any23-mime-2.2.jar




Did simple parse checker on a test html. Getting Errors as
1.  java.util.concurrent.ExecutionException:
java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry
 
Caused by: java.lang.NoClassDefFoundError:
org/eclipse/rdf4j/common/lang/service/ServiceRegistry

2. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl
...
Caused by: java.lang.NoClassDefFoundError:
org/apache/any23/extractor/ExtractorRegistryImpl




Entire log file is attached in debug.txt.


Regards,
Govind
2018-04-02 17:09:49,999 INFO  parse.ParserChecker (ParserChecker.java:run(122)) 
- fetching: file:/tmp/exact_code.html
2018-04-02 17:09:50,205 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No 
object cache found for conf=Configuration: core-default.xml, core-site.xml, 
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,328 DEBUG util.ObjectCache (ObjectCache.java:get(43)) - No 
object cache found for conf=Configuration: core-default.xml, core-site.xml, 
nutch-default.xml, nutch-site.xml, instantiating a new object cache
2018-04-02 17:09:50,366 TRACE file.File (FileResponse.java:(117)) - 
fetching file:/tmp/exact_code.html
2018-04-02 17:09:50,450 INFO  parse.ParseSegment 
(ParseSegment.java:isTruncated(207)) - file:/tmp/exact_code.html skipped. 
Content of size 79433 was truncated to 65536
2018-04-02 17:09:50,450 WARN  parse.ParserChecker (ParserChecker.java:run(187)) 
- Content is truncated, parse may fail!
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: extractor, 
extension-id: ir.co.bayan.simorq.zal.extractor.nutch.ExtractorParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-tika, 
extension-id: org.apache.nutch.parse.tika.TikaParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-ext, 
extension-id: ExtParser
2018-04-02 17:09:50,457 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-html, 
extension-id: org.apache.nutch.parse.html.HtmlParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-js, 
extension-id: JSParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: feed, 
extension-id: org.apache.nutch.parse.feed.FeedParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-swf, 
extension-id: org.apache.nutch.parse.swf.SWFParser
2018-04-02 17:09:50,458 TRACE parse.ParsePluginsReader 
(ParsePluginsReader.java:getAliases(264)) - Found alias: plugin-id: parse-zip, 
extension-id: org.apache.nutch.parse.zip.ZipParser
2018-04-02 17:09:50,461 INFO  parse.ParserFactory 
(ParserFactory.java:matchExtensions(374)) - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser - 
org.apache.nutch.parse.html.HtmlParser] are enabled via the plugin.includes 
system property, and all claim to support the content type text/html, but they 
are not mapped to it  in the parse-plugins.xml file
2018-04-02 17:09:50,871 DEBUG parse.ParseUtil (ParseUtil.java:parse(91)) - 
Parsing [file:/tmp/exact_code.html] with 
[org.apache.nutch.parse.tika.TikaParser@693fe6c9]
2018-04-02 17:09:50,878 DEBUG tika.TikaParser (TikaParser.java:getParse(101)) - 
Using Tika parser org.apache.tika.parser.html.HtmlParser for mime-type text/html
2018-04-02 17:09:51,205 TRACE tika.TikaParser (TikaParser.java:getParse(152)) - 
Meta tags for file:/tmp/exact_code.html: base=null, noCache=false, 
noFollow=false, noIndex=false, refresh=false, refreshHref=null
 * general tags:
   - viewport   =   width=device-width, initial-scale=1
   - dc:title   =   I.F. on Kharms – Just a Beginning
   - content-encoding   =   UTF-8
   - generator  =   WordPress 4.9.4
   - content-type   =   text/html; charset=UTF-8
   - robots =   index,follow
 * http-equiv tags:

2018-04-02 17:09:51,206 TRACE tika.TikaParser (TikaParser.java:getParse(159)) - 
Getting text...
2018-04-02 17:09:51,222 TRACE tika.TikaParser (TikaParser.java:getParse(165)) - 
Getting title...
2018-04-02 17:09:51,224 TRACE tika.TikaParser (TikaParser.java:getParse(183)) - 
Getting links (base URL = file:/tmp/exact_code.html) ...
2018-04-02 17:09:51,227 TRACE