[jira] [Commented] (NUTCH-3020) ParseSegment should check for protocol's flags for truncation

2023-11-06 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783365#comment-17783365
 ] 

Hudson commented on NUTCH-3020:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #140 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/140/])
NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag (#794) 
(github: 
[https://github.com/apache/nutch/commit/90849124d757fb0417ea90576e88b1f55da616f1])
* (edit) src/java/org/apache/nutch/parse/ParseSegment.java
* (add) src/test/org/apache/nutch/parse/TestParseSegment.java


> ParseSegment should check for protocol's flags for truncation
> -
>
> Key: NUTCH-3020
> URL: https://issues.apache.org/jira/browse/NUTCH-3020
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> As discussed on the user list, several protocols can identify when a fetch 
> has been truncated. ParseSegment only checks for the number of bytes fetched 
> vs the http length header (if it exists). We should modify ParseSegment to 
> check for notification of truncation from the protocols.
> I noticed this specifically with okhttp, but other protocols may flag 
> truncation as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Jenkins build is back to normal : Nutch » Nutch-trunk #140

2023-11-06 Thread Apache Jenkins Server
See 




[jira] [Resolved] (NUTCH-3020) ParseSegment should check for protocol's flags for truncation

2023-11-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3020.

Fix Version/s: 1.20
   Resolution: Fixed

> ParseSegment should check for protocol's flags for truncation
> -
>
> Key: NUTCH-3020
> URL: https://issues.apache.org/jira/browse/NUTCH-3020
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> As discussed on the user list, several protocols can identify when a fetch 
> has been truncated. ParseSegment only checks for the number of bytes fetched 
> vs the http length header (if it exists). We should modify ParseSegment to 
> check for notification of truncation from the protocols.
> I noticed this specifically with okhttp, but other protocols may flag 
> truncation as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3020) ParseSegment should check for protocol's flags for truncation

2023-11-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3020?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783360#comment-17783360
 ] 

ASF GitHub Bot commented on NUTCH-3020:
---

tballison merged PR #794:
URL: https://github.com/apache/nutch/pull/794




> ParseSegment should check for protocol's flags for truncation
> -
>
> Key: NUTCH-3020
> URL: https://issues.apache.org/jira/browse/NUTCH-3020
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> As discussed on the user list, several protocols can identify when a fetch 
> has been truncated. ParseSegment only checks for the number of bytes fetched 
> vs the http length header (if it exists). We should modify ParseSegment to 
> check for notification of truncation from the protocols.
> I noticed this specifically with okhttp, but other protocols may flag 
> truncation as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-06 Thread via GitHub


tballison merged PR #794:
URL: https://github.com/apache/nutch/pull/794


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783352#comment-17783352
 ] 

Tim Allison commented on NUTCH-3019:


{noformat}
[junit] Tests run: 7, Failures: 4, Errors: 0, Skipped: 0, Time elapsed: 
2.271 sec
[junit] Test org.apache.nutch.protocol.httpclient.TestProtocolHttpClient 
FAILED 
{noformat}

???

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783327#comment-17783327
 ] 

Hudson commented on NUTCH-3019:
---

FAILURE: Integrated in Jenkins build Nutch » Nutch-trunk #139 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/139/])
NUTCH-3019 -- update Tika (#797) (github: 
[https://github.com/apache/nutch/commit/f88b9a116d6be5eea738d99af65406bdd96fd6d0])
* (edit) src/plugin/parse-tika/ivy.xml
* (edit) src/plugin/parse-tika/plugin.xml
* (edit) src/plugin/language-identifier/plugin.xml
* (edit) src/plugin/language-identifier/ivy.xml
* (edit) ivy/ivy.xml


> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Nutch » Nutch-trunk #139

2023-11-06 Thread Apache Jenkins Server
See 


Changes:

[github] NUTCH-3019 -- update Tika (#797)


--
[...truncated 760.61 KB...]
resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.026 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.074 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.147 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-protocol

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 

[junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-protocol
[junit] Running 
org.apache.nutch.net.urlnormalizer.protocol.TestProtocolURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.951 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-querystring
[junit] Running 
org.apache.nutch.net.urlnormalizer.querystring.TestQuerystringURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.398 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-regex
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.976 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-slash

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:


[jira] [Resolved] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison resolved NUTCH-3019.

Fix Version/s: 1.20
   Resolution: Fixed

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
> Fix For: 1.20
>
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783290#comment-17783290
 ] 

ASF GitHub Bot commented on NUTCH-3019:
---

tballison merged PR #797:
URL: https://github.com/apache/nutch/pull/797




> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub


tballison merged PR #797:
URL: https://github.com/apache/nutch/pull/797


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783254#comment-17783254
 ] 

Tim Allison edited comment on NUTCH-3019 at 11/6/23 3:46 PM:
-

tballison commented on PR #797:
URL: [https://github.com/apache/nutch/pull/797#issuecomment-1795161171]

```

2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, Errors: 0, 
Skipped: 4, Time elapsed: 4.342 sec
2023-11-06T15:02:48.2192793Z [junit] Test 
org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED

```


was (Author: githubbot):
tballison commented on PR #797:
URL: https://github.com/apache/nutch/pull/797#issuecomment-1795161171

   ```2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, 
Errors: 0, Skipped: 4, Time elapsed: 4.342 sec
   2023-11-06T15:02:48.2192793Z [junit] Test 
org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED```




> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783254#comment-17783254
 ] 

ASF GitHub Bot commented on NUTCH-3019:
---

tballison commented on PR #797:
URL: https://github.com/apache/nutch/pull/797#issuecomment-1795161171

   ```2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, 
Errors: 0, Skipped: 4, Time elapsed: 4.342 sec
   2023-11-06T15:02:48.2192793Z [junit] Test 
org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED```




> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub


tballison commented on PR #797:
URL: https://github.com/apache/nutch/pull/797#issuecomment-1795161171

   ```2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, 
Errors: 0, Skipped: 4, Time elapsed: 4.342 sec
   2023-11-06T15:02:48.2192793Z [junit] Test 
org.apache.nutch.protocol.okhttp.TestBadServerResponses FAILED```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Comment Edited] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783252#comment-17783252
 ] 

Tim Allison edited comment on NUTCH-3019 at 11/6/23 3:32 PM:
-

I just got this, which tracks with upgrade to 2.9.0.
{noformat}
 ParserStatus
        failed=84
        success=625{noformat}
https://issues.apache.org/jira/browse/NUTCH-2959?focusedCommentId=17771490=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17771490


was (Author: talli...@mitre.org):
ParserStatus
        failed=84
        success=625

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783252#comment-17783252
 ] 

Tim Allison commented on NUTCH-3019:


ParserStatus
        failed=84
        success=625

> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783228#comment-17783228
 ] 

ASF GitHub Bot commented on NUTCH-3019:
---

tballison commented on PR #797:
URL: https://github.com/apache/nutch/pull/797#issuecomment-1794934171

   Need to keep as draft until the 2.9.1.0 shim actually lands in maven central.




> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub


tballison commented on PR #797:
URL: https://github.com/apache/nutch/pull/797#issuecomment-1794934171

   Need to keep as draft until the 2.9.1.0 shim actually lands in maven central.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3019) Upgrade to Apache Tika 2.9.1

2023-11-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3019?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783227#comment-17783227
 ] 

ASF GitHub Bot commented on NUTCH-3019:
---

tballison opened a new pull request, #797:
URL: https://github.com/apache/nutch/pull/797

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!
   




> Upgrade to Apache Tika 2.9.1
> 
>
> Key: NUTCH-3019
> URL: https://issues.apache.org/jira/browse/NUTCH-3019
> Project: Nutch
>  Issue Type: Task
>Reporter: Tim Allison
>Priority: Minor
>
> There's a commons-compress cve that affects Tika 2.9.0 (CVE-2023-42503). This 
> is fixed by an upgrade in Tika 2.9.1.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub


tballison opened a new pull request, #797:
URL: https://github.com/apache/nutch/pull/797

   Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! 
Your help is appreciated!
   
   Before opening the pull request, please verify that
   * there is an open issue on the [Nutch issue 
tracker](https://issues.apache.org/jira/projects/NUTCH) which describes the 
problem or the improvement. We cannot accept pull requests without an issue 
because the change wouldn't be listed in the release notes.
   * the issue ID (`NUTCH-`)
 - is referenced in the title of the pull request
 - and placed in front of your commit messages surrounded by square 
brackets (`[NUTCH-] Issue or pull request title`)
   * commits are squashed into a single one (or few commits for larger changes)
   * Java source code follows [Nutch Eclipse Code Formatting 
rules](https://github.com/apache/nutch/blob/master/eclipse-codeformat.xml)
   * Nutch is successfully built and unit tests pass by running `ant clean 
runtime test`
   * there should be no conflicts when merging the pull request branch into the 
*recent* master branch. If there are conflicts, please try to rebase the pull 
request branch on top of a freshly pulled master branch.
   * if new dependencies are added,
 - are these dependencies licensed in a way that is compatible for 
inclusion under [ASF 
2.0](https://www.apache.org/legal/resolved.html#category-a)?
 - are `LICENSE-binary` and `NOTICE-binary` updated accordingly?
   
   We will be able to faster integrate your pull request if these conditions 
are met. If you have any questions how to fix your problem or about using Nutch 
in general, please sign up for the [Nutch mailing 
list](https://nutch.apache.org/mailing_lists.html). Thanks!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-06 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17783226#comment-17783226
 ] 

ASF GitHub Bot commented on NUTCH-3025:
---

jnioche opened a new pull request, #796:
URL: https://github.com/apache/nutch/pull/796

   Checks the length of the query and path elements of a URL. 
   see https://issues.apache.org/jira/browse/NUTCH-3025




> urlfilter-fast to filter based on the length of the URL
> ---
>
> Key: NUTCH-3025
> URL: https://issues.apache.org/jira/browse/NUTCH-3025
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Julien Nioche
>Priority: Major
> Fix For: 1.20
>
>
> There currently is no filter implementation to remove URLs based on their 
> length or the length of their path / query.
> Doing so with the regex filter would be inefficient, instead we could 
> implement it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-06 Thread Julien Nioche (Jira)
Julien Nioche created NUTCH-3025:


 Summary: urlfilter-fast to filter based on the length of the URL
 Key: NUTCH-3025
 URL: https://issues.apache.org/jira/browse/NUTCH-3025
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.19
Reporter: Julien Nioche
 Fix For: 1.20


There currently is no filter implementation to remove URLs based on their 
length or the length of their path / query.
Doing so with the regex filter would be inefficient, instead we could implement 
it in _urlfilter-fast _



--
This message was sent by Atlassian Jira
(v8.20.10#820010)