[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765308#comment-17765308 ] ASF GitHub Bot commented on NUTCH-2959: --- tballison opened a new pull request, #776: URL: https://github.com/apache/nutch/pull/776 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch issue tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the problem or the improvement. We cannot accept pull requests without an issue because the change wouldn't be listed in the release notes. * the issue ID (`NUTCH-`) - is referenced in the title of the pull request - and placed in front of your commit messages surrounded by square brackets (`[NUTCH-] Issue or pull request title`) * commits are squashed into a single one (or few commits for larger changes) * Java source code follows [Nutch Eclipse Code Formatting rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml) * Nutch is successfully built and unit tests pass by running `ant clean runtime test` * there should be no conflicts when merging the pull request branch into the *recent* master branch. If there are conflicts, please try to rebase the pull request branch on top of a freshly pulled master branch. * if new dependencies are added, - are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](https://www.apache.org/legal/resolved.html#category-a)? - are `LICENSE-binary` and `NOTICE-binary` updated accordingly? We will be able to faster integrate your pull request if these conditions are met. If you have any questions how to fix your problem or about using Nutch in general, please sign up for the [Nutch mailing list](https://nutch.apache.org/mailing_lists.html). Thanks! > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Affects Versions: 1.19 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765306#comment-17765306 ] Tim Allison commented on NUTCH-2959: Currently working on this to bump to Tika 2.9.0. PR incoming once I get a clean build. > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Affects Versions: 1.19 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725842#comment-17725842 ] Tim Allison commented on NUTCH-2959: tika-server would be cleaner? Could have autoscaling pods of tika-servers? > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Affects Versions: 1.19 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725839#comment-17725839 ] Sebastian Nagel commented on NUTCH-2959: Hi [~tallison], if running in local mode it might be a good option to delegate the parsing to a separate process. When running on a Hadoop cluster, it might cause some headaches to get the process running on the task nodes. > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Affects Versions: 1.19 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725807#comment-17725807 ] Tim Allison commented on NUTCH-2959: Separately, I'm wondering if it would be useful to add an alternative Tika parser that relies on tika-server or a modified version of a pipes-parser. This would put all of the Tika dependencies and jar hell in its own process, and we wouldn't have to load any dependencies aside from tika-core into Nutch's jvm. They're working on doing this over on Solr now as well (I think they've chosen the tika-server route). > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Affects Versions: 1.19 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725805#comment-17725805 ] Tim Allison commented on NUTCH-2959: I just opened a PR to upgrade Tika to 2.8.0 on ANY23: https://issues.apache.org/jira/browse/ANY23-610 -> [https://github.com/apache/any23/pull/320] Let's see if we can get buy-in and maybe another release of any23? > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Affects Versions: 1.19 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582534#comment-17582534 ] Sebastian Nagel commented on NUTCH-2959: Hi [~markus17], moving this to 1.20: I can reproduce the issue with the failing unit test TestRobotsMetaProcessor, likely caused by and incompatibility of Tika 2.3.0 (used by any23) and 2.4.1 (used here by parse-tika and in core). > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Affects Versions: 1.19 >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.20 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577845#comment-17577845 ] Sebastian Nagel commented on NUTCH-2959: Hi [~markus17], regarding the error with javax.ws.rs dependency: this was an issue with a long story until it was fixed (cf. NUTCH-2669, NUTCH-2697, IVY-1586). I remember it was painful to get a clean system: delete ~/.ivy2/ and make sure that no ivy jar older than 2.5.0 is used and writes to ~/.ivy2/. This prohibits building older versions of Nutch, and also other projects built with ant/ivy. An older version of ivy could be also requested and downloaded by a Nutch plugin - check for properties ivy.version or ivy.installversion, and also whether ivy jars happened to be installed somewhere on the system (eg. ~/.ivy2/lib/). While trying to upgrade to 2.4.0 (NUTCH-2948) I've also I've run in a test failure probably due to conflicting dependencies: - tika-core 2.4.0 required by Nutch core (ivy/ivy.xml) - any23 requiring tika-parser 2.3.0 - parse-tika requiring tika-parser 2.4.0 In the past there were no issues as any23 includes tika-core. But, eventually, we now need to exclude or overwrite some deps in any23. > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.19 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577447#comment-17577447 ] Markus Jelsma commented on NUTCH-2959: -- Nice, thanks to NUTCH-2669 i can pass the issue by using: {color:#00}ant -Dpackaging.type=jar clean runtime test{color} The stuff now builds except that i am stopped by the indexer-elastic plugin, it is the same error again that i had some time before as well. {code:java} [javac] /home/markus/projects/apache/nutch/nutch/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: err or: package org.apache.http.impl.nio.client does not exist [javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; {code} I disabled the plugin, the tests seem to pass except for {color:#00}TestRobotsMetaProcessor. It complains about Any23.{color} > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.19 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577420#comment-17577420 ] Markus Jelsma commented on NUTCH-2959: -- Here's a patch. This patch does not include the change in plugin.xml for any23. It is also untested because for some reason i cannot build Nutch anymore, again :( {code:java} [ivy:resolve] [FAILED ] javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type}: (0ms) [ivy:resolve] local: tried [ivy:resolve] /home/markus/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1.1/${packaging.type}s/javax.ws.rs-api.${packaging.type} [ivy:resolve] maven2: tried [ivy:resolve] https://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type} [ivy:resolve] apache-snapshot: tried [ivy:resolve] https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type} [ivy:resolve] sonatype: tried [ivy:resolve] https://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type} [ivy:resolve] :: [ivy:resolve] :: FAILED DOWNLOADS :: [ivy:resolve] :: ^ see resolution messages for details ^ :: [ivy:resolve] :: [ivy:resolve] :: javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type} [ivy:resolve] :: {code} I cleared my Ivy cache, created a clean checkout. Some other build error mysteriously solved itself, now we see this one. I haven´t seen this error in a long time. > Upgrade to Apache Tika 2.4.1 > > > Key: NUTCH-2959 > URL: https://issues.apache.org/jira/browse/NUTCH-2959 > Project: Nutch > Issue Type: Task >Reporter: Markus Jelsma >Priority: Major > Fix For: 1.19 > > Attachments: NUTCH-2959.patch > > -- This message was sent by Atlassian Jira (v8.20.10#820010)