[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-09-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765308#comment-17765308
 ] 

ASF GitHub Bot commented on NUTCH-2959:
---

tballison opened a new pull request, #776:
URL: https://github.com/apache/nutch/pull/776

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!
   




> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-09-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17765306#comment-17765306
 ] 

Tim Allison commented on NUTCH-2959:


Currently working on this to bump to Tika 2.9.0.  PR incoming once I get a 
clean build.

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725842#comment-17725842
 ] 

Tim Allison commented on NUTCH-2959:


tika-server would be cleaner?  Could have autoscaling pods of tika-servers?

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-05-24 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725839#comment-17725839
 ] 

Sebastian Nagel commented on NUTCH-2959:


Hi [~tallison], if running in local mode it might be a good option to delegate 
the parsing to a separate process. When running on a Hadoop cluster, it might 
cause some headaches to get the process running on the task nodes.

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725807#comment-17725807
 ] 

Tim Allison commented on NUTCH-2959:


Separately, I'm wondering if it would be useful to add an alternative Tika 
parser that relies on tika-server or a modified version of a pipes-parser.  
This would put all of the Tika dependencies and jar hell in its own process, 
and we wouldn't have to load any dependencies aside from tika-core into Nutch's 
jvm.

 

They're working on doing this over on Solr now as well (I think they've chosen 
the tika-server route).

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2023-05-24 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17725805#comment-17725805
 ] 

Tim Allison commented on NUTCH-2959:


I just opened a PR to upgrade Tika to 2.8.0 on ANY23: 
https://issues.apache.org/jira/browse/ANY23-610 -> 
[https://github.com/apache/any23/pull/320] 

Let's see if we can get buy-in and maybe another release of any23?

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-21 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582534#comment-17582534
 ] 

Sebastian Nagel commented on NUTCH-2959:


Hi [~markus17], moving this to 1.20: I can reproduce the issue with the failing 
unit test TestRobotsMetaProcessor, likely caused by and incompatibility of Tika 
2.3.0 (used by any23) and 2.4.1 (used here by parse-tika and in core).

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-10 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577845#comment-17577845
 ] 

Sebastian Nagel commented on NUTCH-2959:


Hi [~markus17],

regarding the error with javax.ws.rs dependency: this was an issue with a long 
story until it was fixed (cf. NUTCH-2669, NUTCH-2697, IVY-1586). I remember it 
was painful to get a clean system: delete ~/.ivy2/ and make sure that no ivy 
jar older than 2.5.0 is used and writes to ~/.ivy2/. This prohibits building 
older versions of Nutch, and also other projects built with ant/ivy. An older 
version of ivy could be also requested and downloaded by a Nutch plugin - check 
for properties ivy.version or ivy.installversion, and also whether ivy jars 
happened to be installed somewhere on the system (eg. ~/.ivy2/lib/).

While trying to upgrade to 2.4.0 (NUTCH-2948) I've also I've run in a test 
failure probably due to conflicting dependencies:
- tika-core 2.4.0 required by Nutch core (ivy/ivy.xml)
- any23 requiring tika-parser 2.3.0
- parse-tika requiring tika-parser 2.4.0

In the past there were no issues as any23 includes tika-core. But, eventually, 
we now need to exclude or overwrite some deps in any23.

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.19
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577447#comment-17577447
 ] 

Markus Jelsma commented on NUTCH-2959:
--

Nice, thanks to NUTCH-2669 i can pass the issue by using:
{color:#00}ant -Dpackaging.type=jar clean runtime test{color}


The stuff  now builds except that i am stopped by the indexer-elastic plugin, 
it is the same error again that i had some time before as well.

 
{code:java}
    [javac] 
/home/markus/projects/apache/nutch/nutch/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
 err
or: package org.apache.http.impl.nio.client does not exist
    [javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
{code}
I disabled the plugin, the tests seem to pass except for 
{color:#00}TestRobotsMetaProcessor. It complains about Any23.{color}

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.19
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.4.1

2022-08-09 Thread Markus Jelsma (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17577420#comment-17577420
 ] 

Markus Jelsma commented on NUTCH-2959:
--

Here's a patch. This patch does not include the change in plugin.xml for any23. 
It is also untested because for some reason i cannot build Nutch anymore, again 
:(
{code:java}
[ivy:resolve]   [FAILED ] 
javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type}:  (0ms)
[ivy:resolve]    local: tried
[ivy:resolve] 
/home/markus/.ivy2/local/javax.ws.rs/javax.ws.rs-api/2.1.1/${packaging.type}s/javax.ws.rs-api.${packaging.type}
[ivy:resolve]    maven2: tried
[ivy:resolve] 
https://repo1.maven.org/maven2/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
[ivy:resolve]    apache-snapshot: tried
[ivy:resolve] 
https://repository.apache.org/content/repositories/snapshots/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
[ivy:resolve]    sonatype: tried
[ivy:resolve] 
https://oss.sonatype.org/content/repositories/releases/javax/ws/rs/javax.ws.rs-api/2.1.1/javax.ws.rs-api-2.1.1.${packaging.type}
[ivy:resolve]   ::
[ivy:resolve]   ::  FAILED DOWNLOADS    ::
[ivy:resolve]   :: ^ see resolution messages for details  ^ ::
[ivy:resolve]   ::
[ivy:resolve]   :: 
javax.ws.rs#javax.ws.rs-api;2.1.1!javax.ws.rs-api.${packaging.type}
[ivy:resolve]   ::
{code}
I cleared my Ivy cache, created a clean checkout. Some other build error 
mysteriously solved itself, now we see this one. I haven´t seen this error in a 
long time.

> Upgrade to Apache Tika 2.4.1
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.19
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)