[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769299#comment-17769299
 ] 

ASF GitHub Bot commented on NUTCH-2959:
---

sebastian-nagel commented on PR #776:
URL: https://github.com/apache/nutch/pull/776#issuecomment-1736008780

   > I suggest that we downgrade to Tika 2.2.1 to fix that regression.
   
   Good point, @lewismc. I've opened NUTCH-3006 for that.
   




> Upgrade to Apache Tika 2.9.0
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-09-26 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3006:
--

 Summary: Downgrade Tika dependency to 2.2.1 (core and parse-tika)
 Key: NUTCH-3006
 URL: https://issues.apache.org/jira/browse/NUTCH-3006
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.20
Reporter: Sebastian Nagel
 Fix For: 1.20


Tika 2.3.0 and upwards depend on a commons-io 2.11.0 (or even higher) which is 
not available when Nutch is used on Hadoop. Only Hadoop 3.4.0 is expected to 
ship with commons-io 2.11.0 (HADOOP-18301), all currently released versions 
provide commons-io 2.8.0. Because Hadoop-required dependencies are enforced in 
(pseudo)distributed mode, using Tika may cause issues, see NUTCH-2937 and 
NUTCH-2959.

[~lewismc] suggested in the discussion of [Githup PR 
#776|https://github.com/apache/nutch/pull/776] to downgrade to Tika 2.2.1 to 
resolve these issues for now and until Hadoop 3.4.0 becomes available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-26 Thread via GitHub


sebastian-nagel commented on PR #776:
URL: https://github.com/apache/nutch/pull/776#issuecomment-1736008780

   > I suggest that we downgrade to Tika 2.2.1 to fix that regression.
   
   Good point, @lewismc. I've opened NUTCH-3006 for that.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769293#comment-17769293
 ] 

ASF GitHub Bot commented on NUTCH-2990:
---

sebastian-nagel commented on PR #779:
URL: https://github.com/apache/nutch/pull/779#issuecomment-1735968193

   >  an example on hand of a robots.txt which can be fetched with >1 redirects?
   
   http://wikipedia.org/robots.txt
   
   Note: works with protocol-http, for protocol-okhttp need also to apply the 
fix for NUTCH-3002.
   
   Maybe as an additional note: this PR removes the secondary lookup for a 
lower-cased "location" header. Case-insensitive lookup of protocol metadata 
should be implemented on protocol level.




> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> ---
>
> Key: NUTCH-2990
> URL: https://issues.apache.org/jira/browse/NUTCH-2990
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, robots
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel commented on pull request #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub


sebastian-nagel commented on PR #779:
URL: https://github.com/apache/nutch/pull/779#issuecomment-1735968193

   >  an example on hand of a robots.txt which can be fetched with >1 redirects?
   
   http://wikipedia.org/robots.txt
   
   Note: works with protocol-http, for protocol-okhttp need also to apply the 
fix for NUTCH-3002.
   
   Maybe as an additional note: this PR removes the secondary lookup for a 
lower-cased "location" header. Case-insensitive lookup of protocol metadata 
should be implemented on protocol level.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



Establishing a Nutch development roadmap

2023-09-26 Thread lewis john mcgibbney
Hi dev@,

I've been at arms length for a while as $dayjob changed and then
changed again over the last number of years.

With that being said, I wanted to start a thread on $title with the
goal of establishing some "big items" we could put on the roadmap and
maybe even publish...

Here are some of the thing's I've been thinking about (unordered)

* NUTCH-2940 Develop Gradle Core Build for Apache Nutch
* Metrics system integration cf. https://github.com/apache/nutch/pull/712
* Upgrading Javac version > 11
* Trade study to consider integrating (something like) Plugin
Framework for Java (PF4J) into Nutch
* porting Nutch to run on Apache Beam https://beam.apache.org/

Does anyone else have candidates they wish to add?

Thanks for your consideration.

lewismc


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769255#comment-17769255
 ] 

ASF GitHub Bot commented on NUTCH-2990:
---

lewismc commented on PR #779:
URL: https://github.com/apache/nutch/pull/779#issuecomment-1735761972

   Very nice @sebastian-nagel 
   Do you have an example on hand of a robots.txt which can be fetched with >1 
redirects?




> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> ---
>
> Key: NUTCH-2990
> URL: https://issues.apache.org/jira/browse/NUTCH-2990
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, robots
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] lewismc commented on pull request #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub


lewismc commented on PR #779:
URL: https://github.com/apache/nutch/pull/779#issuecomment-1735761972

   Very nice @sebastian-nagel 
   Do you have an example on hand of a robots.txt which can be fetched with >1 
redirects?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Created] (NUTCH-3005) Upgrade selenium as needed

2023-09-26 Thread Tim Allison (Jira)
Tim Allison created NUTCH-3005:
--

 Summary: Upgrade selenium as needed
 Key: NUTCH-3005
 URL: https://issues.apache.org/jira/browse/NUTCH-3005
 Project: Nutch
  Issue Type: Improvement
Reporter: Tim Allison


When we choose to upgrade selenium, we should take note of this blog about 
changes in headless chromium: 
https://www.selenium.dev/blog/2023/headless-is-going-away/

ChromeOptions options = new ChromeOptions();
options.addArguments("--headless=new");
WebDriver driver = new ChromeDriver(options);
driver.get("https://selenium.dev;);
driver.quit();



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769168#comment-17769168
 ] 

Hudson commented on NUTCH-3004:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #115 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/115/])
NUTCH-3004 -- propagate ssl exception if message doesn't match "handshake 
alert..." (tallison: 
[https://github.com/apache/nutch/commit/5be64d2dad755f55980a1ea767abfb8e9fcc808a])
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is back to normal : Nutch » Nutch-trunk #115

2023-09-26 Thread Apache Jenkins Server
See 




Build failed in Jenkins: Nutch » Nutch-trunk #114

2023-09-26 Thread Apache Jenkins Server
See 


Changes:


--
Started by an SCM change
Running as SYSTEM
[EnvInject] - Loading node environment variables.
Building remotely on builds58 (ubuntu) in workspace 

The recommended git tool is: NONE
No credentials specified
 > git rev-parse --resolve-git-dir 
 >  # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git --version # 'git version 2.17.1'
 > git fetch --tags --progress -- https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/* # timeout=10
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:1003)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1245)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1309)
at hudson.scm.SCM.checkout(SCM.java:540)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1240)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:649)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:85)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:521)
at hudson.model.Run.execute(Run.java:1900)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
at hudson.model.ResourceController.execute(ResourceController.java:101)
at hudson.model.Executor.run(Executor.java:442)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress -- https://github.com/apache/nutch.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: remote: Enumerating objects: 12041, done.
remote: Counting objects:   0% (1/2802)remote: Counting objects:   1% 
(29/2802)remote: Counting objects:   2% (57/2802)remote: 
Counting objects:   3% (85/2802)remote: Counting objects:   4% 
(113/2802)remote: Counting objects:   5% (141/2802)remote: 
Counting objects:   6% (169/2802)remote: Counting objects:   7% 
(197/2802)remote: Counting objects:   8% (225/2802)remote: 
Counting objects:   9% (253/2802)remote: Counting objects:  10% 
(281/2802)remote: Counting objects:  11% (309/2802)remote: 
Counting objects:  12% (337/2802)remote: Counting objects:  13% 
(365/2802)remote: Counting objects:  14% (393/2802)remote: 
Counting objects:  15% (421/2802)remote: Counting objects:  16% 
(449/2802)remote: Counting objects:  17% (477/2802)remote: 
Counting objects:  18% (505/2802)remote: Counting objects:  19% 
(533/2802)remote: Counting objects:  20% (561/2802)remote: 
Counting objects:  21% (589/2802)remote: Counting objects:  22% 
(617/2802)remote: Counting objects:  23% (645/2802)remote: 
Counting objects:  24% (673/2802)remote: Counting objects:  25% 
(701/2802)remote: Counting objects:  26% (729/2802)remote: 
Counting objects:  27% (757/2802)remote: Counting objects:  28% 
(785/2802)remote: Counting objects:  29% (813/2802)remote: 
Counting objects:  30% (841/2802)remote: Counting objects:  31% 
(869/2802)remote: Counting objects:  32% (897/2802)remote: 
Counting objects:  33% (925/2802)remote: Counting objects:  34% 
(953/2802)remote: Counting objects:  35% (981/2802)remote: 
Counting objects:  36% (1009/2802)remote: Counting objects:  37% 
(1037/2802)remote: Counting objects:  38% (1065/2802)remote: 
Counting objects:  39% (1093/2802)remote: Counting objects:  40% 
(1121/2802)remote: Counting objects:  41% (1149/2802)remote: 
Counting objects:  42% (1177/2802)remote: Counting objects:  43% 
(1205/2802)remote: Counting objects:  44% (1233/2802)remote: 
Counting objects:  45% (1261/2802)remote: Counting objects:  46% 
(1289/2802)remote: Counting objects:  47% (1317/2802)remote: 
Counting objects:  48% (1345/2802)remote: Counting objects:  49% 
(1373/2802)remote: Counting objects:  50% (1401/2802)remote: 
Counting objects:  51% (1430/2802)remote: Counting objects:  52% 
(1458/2802)remote: Counting objects:  53% (1486/2802)remote: 
Counting objects:  54% (1514/2802)remote: Counting objects:  55% 
(1542/2802)remote: Counting objects:  56% (1570/2802)

Build failed in Jenkins: Nutch » Nutch-trunk #113

2023-09-26 Thread Apache Jenkins Server
See 


Changes:


--
Started by an SCM change
Running as SYSTEM
[EnvInject] - Loading node environment variables.
Building remotely on builds58 (ubuntu) in workspace 

The recommended git tool is: NONE
No credentials specified
 > git rev-parse --resolve-git-dir 
 >  # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git --version # 'git version 2.17.1'
 > git fetch --tags --progress -- https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/* # timeout=10
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:1003)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1245)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1309)
at hudson.scm.SCM.checkout(SCM.java:540)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1240)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:649)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:85)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:521)
at hudson.model.Run.execute(Run.java:1900)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:44)
at hudson.model.ResourceController.execute(ResourceController.java:101)
at hudson.model.Executor.run(Executor.java:442)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress -- https://github.com/apache/nutch.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: remote: Enumerating objects: 12041, done.
remote: Counting objects:   0% (1/2802)remote: Counting objects:   1% 
(29/2802)remote: Counting objects:   2% (57/2802)remote: 
Counting objects:   3% (85/2802)remote: Counting objects:   4% 
(113/2802)remote: Counting objects:   5% (141/2802)remote: 
Counting objects:   6% (169/2802)remote: Counting objects:   7% 
(197/2802)remote: Counting objects:   8% (225/2802)remote: 
Counting objects:   9% (253/2802)remote: Counting objects:  10% 
(281/2802)remote: Counting objects:  11% (309/2802)remote: 
Counting objects:  12% (337/2802)remote: Counting objects:  13% 
(365/2802)remote: Counting objects:  14% (393/2802)remote: 
Counting objects:  15% (421/2802)remote: Counting objects:  16% 
(449/2802)remote: Counting objects:  17% (477/2802)remote: 
Counting objects:  18% (505/2802)remote: Counting objects:  19% 
(533/2802)remote: Counting objects:  20% (561/2802)remote: 
Counting objects:  21% (589/2802)remote: Counting objects:  22% 
(617/2802)remote: Counting objects:  23% (645/2802)remote: 
Counting objects:  24% (673/2802)remote: Counting objects:  25% 
(701/2802)remote: Counting objects:  26% (729/2802)remote: 
Counting objects:  27% (757/2802)remote: Counting objects:  28% 
(785/2802)remote: Counting objects:  29% (813/2802)remote: 
Counting objects:  30% (841/2802)remote: Counting objects:  31% 
(869/2802)remote: Counting objects:  32% (897/2802)remote: 
Counting objects:  33% (925/2802)remote: Counting objects:  34% 
(953/2802)remote: Counting objects:  35% (981/2802)remote: 
Counting objects:  36% (1009/2802)remote: Counting objects:  37% 
(1037/2802)remote: Counting objects:  38% (1065/2802)remote: 
Counting objects:  39% (1093/2802)remote: Counting objects:  40% 
(1121/2802)remote: Counting objects:  41% (1149/2802)remote: 
Counting objects:  42% (1177/2802)remote: Counting objects:  43% 
(1205/2802)remote: Counting objects:  44% (1233/2802)remote: 
Counting objects:  45% (1261/2802)remote: Counting objects:  46% 
(1289/2802)remote: Counting objects:  47% (1317/2802)remote: 
Counting objects:  48% (1345/2802)remote: Counting objects:  49% 
(1373/2802)remote: Counting objects:  50% (1401/2802)remote: 
Counting objects:  51% (1430/2802)remote: Counting objects:  52% 
(1458/2802)remote: Counting objects:  53% (1486/2802)remote: 
Counting objects:  54% (1514/2802)remote: Counting objects:  55% 
(1542/2802)remote: Counting objects:  56% (1570/2802)

[jira] [Resolved] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3004.

Resolution: Fixed

> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769097#comment-17769097
 ] 

ASF GitHub Bot commented on NUTCH-2959:
---

tballison commented on PR #776:
URL: https://github.com/apache/nutch/pull/776#issuecomment-1735261857

   Converting this to draft until Hadoop 3.4.0 is released.




> Upgrade to Apache Tika 2.9.0
> 
>
> Key: NUTCH-2959
> URL: https://issues.apache.org/jira/browse/NUTCH-2959
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Priority: Major
> Fix For: 1.20
>
> Attachments: NUTCH-2959.patch
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-26 Thread via GitHub


tballison commented on PR #776:
URL: https://github.com/apache/nutch/pull/776#issuecomment-1735261857

   Converting this to draft until Hadoop 3.4.0 is released.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769096#comment-17769096
 ] 

ASF GitHub Bot commented on NUTCH-3004:
---

tballison merged PR #778:
URL: https://github.com/apache/nutch/pull/778




> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] tballison merged pull request #778: NUTCH-3004

2023-09-26 Thread via GitHub


tballison merged PR #778:
URL: https://github.com/apache/nutch/pull/778


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3004:
---
Fix Version/s: 1.20

> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3004:
---
Component/s: plugin
 protocol

> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, protocol
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
> Fix For: 1.20
>
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-3004:
---
Affects Version/s: 1.19

> Avoid NPE in HttpResponse
> -
>
> Key: NUTCH-3004
> URL: https://issues.apache.org/jira/browse/NUTCH-3004
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Tim Allison
>Priority: Trivial
>
> I recently deployed nutch on a FIPS enabled rhel 8 instance, and I got an NPE 
> in HttpResponse.  When I set the log level to debug, I could see what was 
> happening, but it would have been better to get a meaningful exception rather 
> than an NPE.
> The issue is that in the catch clause, the exception is propagated only if 
> the message is "handshake alert..." and then the reconnect fails.  If the 
> message is not that, then the ssl socket remains null, and we get an NPE 
> below the source I quote here.
> I think we should throw the same HTTPException that we do throw in the nested 
> try if the message is not "handshake alert..."
> {code:java}
> try {
>   sslsocket = getSSLSocket(socket, sockHost, sockPort);
>   sslsocket.startHandshake();
> } catch (Exception e) {
>   Http.LOG.debug("SSL connection to {} failed with: {}", url,
>   e.getMessage());
>   if ("handshake alert:  unrecognized_name".equals(e.getMessage())) {
> try {
>   // Reconnect, see NUTCH-2447
>   socket = new Socket();
>   socket.setSoTimeout(http.getTimeout());
>   socket.connect(sockAddr, http.getTimeout());
>   sslsocket = getSSLSocket(socket, "", sockPort);
>   sslsocket.startHandshake();
> } catch (Exception ex) {
>   String msg = "SSL reconnect to " + url + " failed with: "
>   + e.getMessage();
>   throw new HttpException(msg);
> }
>   }
> }
> socket = sslsocket;
>   }
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769045#comment-17769045
 ] 

ASF GitHub Bot commented on NUTCH-2990:
---

sebastian-nagel opened a new pull request, #779:
URL: https://github.com/apache/nutch/pull/779

   - follow multiple redirects when fetching robots.txt
   - number of followed redirects is configurable by the property 
`http.robots.redirect.max` (default: 5)
   - improvements in RobotRulesParser's robots.txt test utility
 - bug fix: the passed agent names need to be transferred to the property 
http.robots.agents earlier, before the protocol plugins are configured
 - more verbose debug logging
   
   




> HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309
> ---
>
> Key: NUTCH-2990
> URL: https://issues.apache.org/jira/browse/NUTCH-2990
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol, robots
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> The robots.txt parser 
> ([HttpRobotRulesParser|https://nutch.apache.org/documentation/javadoc/apidocs/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.html])
>  follows only one redirect when fetching the robots.txt while the robots.txt 
> RFC 9309 recommends to follow 5 redirects:
> {quote} 2.3.1.2. Redirects
> It's possible that a server responds to a robots.txt fetch request with a 
> redirect, such as HTTP 301 or HTTP 302 in the case of HTTP. The crawlers 
> SHOULD follow at least five consecutive redirects, even across authorities 
> (for example, hosts in the case of HTTP).
> If a robots.txt file is reached within five consecutive redirects, the 
> robots.txt file MUST be fetched, parsed, and its rules followed in the 
> context of the initial authority. If there are more than five consecutive 
> redirects, crawlers MAY assume that the robots.txt file is unavailable.
> (https://datatracker.ietf.org/doc/html/rfc9309#name-redirects){quote}
> While following redirects, the parser should check whether the redirect 
> location is itself a "/robots.txt" on a different host and then try to read 
> it from the cache.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel opened a new pull request, #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub


sebastian-nagel opened a new pull request, #779:
URL: https://github.com/apache/nutch/pull/779

   - follow multiple redirects when fetching robots.txt
   - number of followed redirects is configurable by the property 
`http.robots.redirect.max` (default: 5)
   - improvements in RobotRulesParser's robots.txt test utility
 - bug fix: the passed agent names need to be transferred to the property 
http.robots.agents earlier, before the protocol plugins are configured
 - more verbose debug logging
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org