[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-15 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1721153679 Converting to draft to manually check for version conflicts. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [nutch] sebastian-nagel commented on pull request #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-09-14 Thread via GitHub
sebastian-nagel commented on PR #772: URL: https://github.com/apache/nutch/pull/772#issuecomment-1719977976 > I'll merge this in a day or so unless anyone has objections. Give me a few more days, over the weekend. I'd like to test it at least on a [pseudo-distributed Hadoop

[GitHub] [nutch] tballison opened a new pull request, #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-14 Thread via GitHub
tballison opened a new pull request, #776: URL: https://github.com/apache/nutch/pull/776 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-14 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1719956656 I'll merge this in a day or so unless anyone has objections. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [nutch] tballison commented on pull request #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-09-14 Thread via GitHub
tballison commented on PR #772: URL: https://github.com/apache/nutch/pull/772#issuecomment-1719961162 I'll merge this in a day or so unless anyone has objections. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the

[GitHub] [nutch] tballison merged pull request #775: NUTCH-2998 -- Remove Any23 from Nutch

2023-09-14 Thread via GitHub
tballison merged PR #775: URL: https://github.com/apache/nutch/pull/775 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] sebastian-nagel commented on pull request #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-09-17 Thread via GitHub
sebastian-nagel commented on PR #772: URL: https://github.com/apache/nutch/pull/772#issuecomment-1722472438 +1 A test with the [pseudo-distributed Hadoop setup](https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) was successful: - Nutch tools work properly, no

[GitHub] [nutch] tballison commented on pull request #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-09-17 Thread via GitHub
tballison commented on PR #772: URL: https://github.com/apache/nutch/pull/772#issuecomment-1722508915 Fantastic! Thank you so much Sebastian! On Sun, Sep 17, 2023 at 9:02 AM Sebastian Nagel ***@***.***> wrote: > +1 > > A test with the pseudo-distributed Hadoop setup

[GitHub] [nutch] tballison merged pull request #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-09-17 Thread via GitHub
tballison merged PR #772: URL: https://github.com/apache/nutch/pull/772 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] tballison opened a new pull request, #769: NUTCH-2978 -- move to log4j2 logging throughout

2023-08-28 Thread via GitHub
tballison opened a new pull request, #769: URL: https://github.com/apache/nutch/pull/769 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[GitHub] [nutch] tballison commented on pull request #769: NUTCH-2978 -- move to log4j2 logging throughout

2023-08-28 Thread via GitHub
tballison commented on PR #769: URL: https://github.com/apache/nutch/pull/769#issuecomment-1696087426 This is a draft. More work is required. Please help. :D -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [nutch] tballison merged pull request #770: NUTCH-2999 Upgrade Lucene to latest 8.x version throughout

2023-08-30 Thread via GitHub
tballison merged PR #770: URL: https://github.com/apache/nutch/pull/770 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] tballison closed pull request #769: NUTCH-2978 -- move to log4j2 logging throughout

2023-08-30 Thread via GitHub
tballison closed pull request #769: NUTCH-2978 -- move to log4j2 logging throughout URL: https://github.com/apache/nutch/pull/769 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [nutch] tballison commented on pull request #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-09-13 Thread via GitHub
tballison commented on PR #772: URL: https://github.com/apache/nutch/pull/772#issuecomment-1717765669 If folks could test this out on their workloads, that'd be fantastic! It works on mine, but I'm really hesitant to merge until someone else runs it. Thank you! -- This is an automated

[GitHub] [nutch] tballison opened a new pull request, #775: Remove Any23 from Nutch

2023-09-13 Thread via GitHub
tballison opened a new pull request, #775: URL: https://github.com/apache/nutch/pull/775 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[GitHub] [nutch] tballison opened a new pull request, #773: NUTCH-3000 - the selenium protocol should return the full html, not just the inner body

2023-09-13 Thread via GitHub
tballison opened a new pull request, #773: URL: https://github.com/apache/nutch/pull/773 …ust the inner body element. Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that *

[GitHub] [nutch] tballison opened a new pull request, #774: NUTCH-3001 - fix logic for grabbing bytes if there's no content type …

2023-09-13 Thread via GitHub
tballison opened a new pull request, #774: URL: https://github.com/apache/nutch/pull/774 …in the header Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an

[GitHub] [nutch] tballison commented on pull request #775: Remove Any23 from Nutch

2023-09-13 Thread via GitHub
tballison commented on PR #775: URL: https://github.com/apache/nutch/pull/775#issuecomment-1717820655 When I build this, I get this harmless (?) warning in `src/plugin/logs/hadoop.log`: ``` 2023-02-24 10:07:39,218 WARN o.a.n.p.PluginManifestParser [main] Error while loading

[GitHub] [nutch] tballison merged pull request #774: NUTCH-3001 - fix logic for grabbing bytes if there's no content type …

2023-09-13 Thread via GitHub
tballison merged PR #774: URL: https://github.com/apache/nutch/pull/774 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] tballison merged pull request #773: NUTCH-3000 - the selenium protocol should return the full html, not just the inner body

2023-09-13 Thread via GitHub
tballison merged PR #773: URL: https://github.com/apache/nutch/pull/773 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] tballison opened a new pull request, #770: NUTCH-2999 Upgrade Lucene to latest 8.x version throughout

2023-08-30 Thread via GitHub
tballison opened a new pull request, #770: URL: https://github.com/apache/nutch/pull/770 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[GitHub] [nutch] tballison merged pull request #771: NUTCH-2999 fix for initial PR

2023-08-30 Thread via GitHub
tballison merged PR #771: URL: https://github.com/apache/nutch/pull/771 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] tballison opened a new pull request, #771: NUTCH-2999 fix for initial PR

2023-08-30 Thread via GitHub
tballison opened a new pull request, #771: URL: https://github.com/apache/nutch/pull/771 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[GitHub] [nutch] tballison commented on pull request #771: NUTCH-2999 fix for initial PR

2023-08-30 Thread via GitHub
tballison commented on PR #771: URL: https://github.com/apache/nutch/pull/771#issuecomment-1699690920 Apologies for the noise! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific

[GitHub] [nutch] tballison opened a new pull request, #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-08-31 Thread via GitHub
tballison opened a new pull request, #772: URL: https://github.com/apache/nutch/pull/772 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[GitHub] [nutch] tballison commented on pull request #772: NUTCH-2978 -- upgrade to log4j2 throughout

2023-09-14 Thread via GitHub
tballison commented on PR #772: URL: https://github.com/apache/nutch/pull/772#issuecomment-1720326084 Y, of course. That'd be fantastic. Thank you! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go

[PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-01 Thread via GitHub
tballison opened a new pull request, #794: URL: https://github.com/apache/nutch/pull/794 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-01 Thread via GitHub
lewismc commented on PR #794: URL: https://github.com/apache/nutch/pull/794#issuecomment-1789810071 We have no tests for `ParseSegment` right now. I think it would be excellent if this PR could include a test for `ParseSegment.isTruncated`. -- This is an automated message from the Apache

Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-11-02 Thread via GitHub
lewismc merged PR #789: URL: https://github.com/apache/nutch/pull/789 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-11-02 Thread via GitHub
lewismc commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r138646 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config) @Override

[PR] NUTCH-3024 Remove flaky 'dependency check' target [nutch]

2023-11-03 Thread via GitHub
lewismc opened a new pull request, #795: URL: https://github.com/apache/nutch/pull/795 Addresses https://issues.apache.org/jira/browse/NUTCH-3024 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub
jnioche commented on PR #796: URL: https://github.com/apache/nutch/pull/796#issuecomment-1798221743 Writing a test for this thing is an absolute pain. The way the filters are used for real is that their method setConf is called and the rules are loaded using _getConfResourceAsReader_, i.e.

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub
sebastian-nagel commented on code in PR #796: URL: https://github.com/apache/nutch/pull/796#discussion_r1384536930 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter {

Re: [PR] [NUTCH-3025] urlfilter-fast to filter based on the length of the URL [nutch]

2023-11-07 Thread via GitHub
jnioche commented on code in PR #796: URL: https://github.com/apache/nutch/pull/796#discussion_r1384621727 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -97,9 +97,17 @@ public class FastURLFilter implements URLFilter {

Re: [PR] NUTCH-3014 Standardize Job names [nutch]

2023-10-29 Thread via GitHub
sebastian-nagel commented on code in PR #789: URL: https://github.com/apache/nutch/pull/789#discussion_r1375421979 ## src/java/org/apache/nutch/crawl/CrawlDbReader.java: ## @@ -812,7 +811,7 @@ public CrawlDatum get(String crawlDb, String url, Configuration config)

Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-10-31 Thread via GitHub
sebastian-nagel commented on code in PR #793: URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -181,9 +186,23 @@ public String filter(String url) { public

Re: [PR] [NUTCH-3017] Allow fast-urlfilter to load from HDFS/S3 [nutch]

2023-10-31 Thread via GitHub
sebastian-nagel commented on code in PR #793: URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552 ## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ## @@ -181,9 +186,23 @@ public String filter(String url) { public

Re: [PR] NUTCH-3015 Add more CI steps to GitHub master-build.yml [nutch]

2023-10-23 Thread via GitHub
lewismc commented on PR #790: URL: https://github.com/apache/nutch/pull/790#issuecomment-1775944455 I realize that this is a pretty HUGE pull request but I will qualify that by saying that absolutely no functionality has been changed here. The only changes are with the GitHub CI. --

Re: [PR] NUTCH-3015 Add more CI steps to GitHub master-build.yml [nutch]

2023-10-23 Thread via GitHub
lewismc commented on PR #790: URL: https://github.com/apache/nutch/pull/790#issuecomment-1775942988 CI has stabilized and we now have passing builds for ubuntu and macos. Windows builds were failing so I just disabled them... I can add them back in though if we want to...? I also

Re: [PR] NUTCH-2887 Migrate to JUnit 5 Jupiter [nutch]

2023-10-24 Thread via GitHub
lewismc commented on PR #791: URL: https://github.com/apache/nutch/pull/791#issuecomment-1778548551 OK, I'm over the bulk of the work here. A brief synopsis of what has been done so far... * new junit5 jupiter dependencies added to `ivy.xml` * All assertions migrated to new

Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub
tballison commented on PR #797: URL: https://github.com/apache/nutch/pull/797#issuecomment-1794934171 Need to keep as draft until the 2.9.1.0 shim actually lands in maven central. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub
tballison opened a new pull request, #797: URL: https://github.com/apache/nutch/pull/797 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

Re: [PR] NUTCH-3020 -- ParseSegment should check for okhttp's truncation flag [nutch]

2023-11-06 Thread via GitHub
tballison merged PR #794: URL: https://github.com/apache/nutch/pull/794 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub
tballison merged PR #797: URL: https://github.com/apache/nutch/pull/797 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3019 -- update Tika to 2.9.1 [nutch]

2023-11-06 Thread via GitHub
tballison commented on PR #797: URL: https://github.com/apache/nutch/pull/797#issuecomment-1795161171 ```2023-11-06T15:02:47.9408964Z [junit] Tests run: 14, Failures: 2, Errors: 0, Skipped: 4, Time elapsed: 4.342 sec 2023-11-06T15:02:48.2192793Z [junit] Test

Re: [PR] NUTCH-3015 Add more CI steps to GitHub master-build.yml [nutch]

2023-10-27 Thread via GitHub
lewismc merged PR #790: URL: https://github.com/apache/nutch/pull/790 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

Re: [PR] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] [nutch]

2023-10-30 Thread via GitHub
jnioche commented on PR #792: URL: https://github.com/apache/nutch/pull/792#issuecomment-1785804884 Obivously, pulled more changes than I meant to -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to

[PR] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] [nutch]

2023-10-30 Thread via GitHub
jnioche opened a new pull request, #792: URL: https://github.com/apache/nutch/pull/792 See description in https://issues.apache.org/jira/browse/NUTCH-3017 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

Re: [PR] Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] [nutch]

2023-10-30 Thread via GitHub
jnioche closed pull request #792: Allow fast-urlfilter to load from HDFS/S3 and support gzipped input [NUTCH-3017] URL: https://github.com/apache/nutch/pull/792 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [nutch] lewismc commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-21 Thread via GitHub
lewismc commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1729671358 I suggest that we downgrade to Tika 2.2.1 to fix that regression. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-18 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1724217959 I bumped some of the more common dependencies to match Tika 2.9.0. Let me know what you think. -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [nutch] tballison opened a new pull request, #778: NUTCH-3004

2023-09-25 Thread via GitHub
tballison opened a new pull request, #778: URL: https://github.com/apache/nutch/pull/778 Thanks for your contribution to [Apache Nutch](https://nutch.apache.org/)! Your help is appreciated! Before opening the pull request, please verify that * there is an open issue on the [Nutch

[GitHub] [nutch] lewismc commented on pull request #782: NUTCH-3009 Upgrade to Hadoop 3.3.6

2023-09-28 Thread via GitHub
lewismc commented on PR #782: URL: https://github.com/apache/nutch/pull/782#issuecomment-1739126057 I’ll check Javac output for any deprecation warnings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above

[GitHub] [nutch] sebastian-nagel opened a new pull request, #782: NUTCH-3009 Upgrade to Hadoop 3.3.6

2023-09-28 Thread via GitHub
sebastian-nagel opened a new pull request, #782: URL: https://github.com/apache/nutch/pull/782 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-28 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-173655 Paging the `nutch-test-single-node-cluster` helpdesk what do I use for the tika seeds file? Are you using[ our github repo, or the tika-parsers-common package

[GitHub] [nutch] lewismc commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-28 Thread via GitHub
lewismc commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1740040368 Try the full path Also make sure the directory exists on HDFS On Thu, Sep 28, 2023 at 14:03 Tim Allison ***@***.***> wrote: > Paging the nutch-test-single-node-cluster

[GitHub] [nutch] sebastian-nagel commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
sebastian-nagel commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1740397046 > what do I use for the tika seeds file? Are you using our github repo, or the > tika-parsers-common package specifically see the comments in

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741039619 With the update to Tika 2.9.1-SNAPSHOT, I get 85 failed parses, most of them are either encrypted documents or "can't retrieve Tika Parser for x"

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741040471 >see the comments in [test_tika_parser.sh](https://github.com/sebastian-nagel/nutch-test-single-node-cluster/blob/master/test_tika_parser.sh) Sorry! Yep, saw that too late. --

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741061515 I reverted back to 2.2.1, and that's not far enough back -- there were 222 parse failures many with the wrap problem. I reverted back to 2.0.0, and then had 85 parse failures again. This

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741263080 Alright, the only thing that I think _might_ work is Tika shading commons-io in tika-app, and then Nutch uses tika-app instead of the individual parser-modules etc. for parser-tika.

[GitHub] [nutch] sebastian-nagel opened a new pull request, #785: NUTCH-2853 bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean

2023-09-30 Thread via GitHub
sebastian-nagel opened a new pull request, #785: URL: https://github.com/apache/nutch/pull/785 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

[GitHub] [nutch] sebastian-nagel opened a new pull request, #784: NUTCH-2897 Do not supress deprecated API warnings

2023-09-30 Thread via GitHub
sebastian-nagel opened a new pull request, #784: URL: https://github.com/apache/nutch/pull/784 - deprecate constructor of NutchJob - remove deprocated call to Object.finalize() from Plugin.finalize() -- This is an automated message from the Apache Git Service. To respond to the

[GitHub] [nutch] sebastian-nagel opened a new pull request, #783: NUTCH-3010 Injector: count unique number of injected URLs

2023-09-30 Thread via GitHub
sebastian-nagel opened a new pull request, #783: URL: https://github.com/apache/nutch/pull/783 - add counter urls_injected_unique - improve log messages reporting the counts of injected/merged URLs -- This is an automated message from the Apache Git Service. To respond to the message,

[GitHub] [nutch] sebastian-nagel merged pull request #780: NUTCH-2852 SpotBugs: Method invokes System.exit(...)

2023-09-30 Thread via GitHub
sebastian-nagel merged PR #780: URL: https://github.com/apache/nutch/pull/780 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] sebastian-nagel merged pull request #781: NUTCH-3007 Fix impossible casts

2023-09-30 Thread via GitHub
sebastian-nagel merged PR #781: URL: https://github.com/apache/nutch/pull/781 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] sebastian-nagel opened a new pull request, #786: NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-01 Thread via GitHub
sebastian-nagel opened a new pull request, #786: URL: https://github.com/apache/nutch/pull/786 (no comment) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe,

Re: [PR] NUTCH-3010 Injector: count unique number of injected URLs [nutch]

2023-10-02 Thread via GitHub
sebastian-nagel merged PR #783: URL: https://github.com/apache/nutch/pull/783 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3010 Injector: count unique number of injected URLs [nutch]

2023-10-02 Thread via GitHub
sebastian-nagel commented on PR #783: URL: https://github.com/apache/nutch/pull/783#issuecomment-1742673559 > update the [Injector metrics documentation](https://cwiki.apache.org/confluence/display/NUTCH/Metrics#Metrics-Injector). Done. Thanks, @lewismc! -- This is an automated

[GitHub] [nutch] sebastian-nagel opened a new pull request, #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub
sebastian-nagel opened a new pull request, #779: URL: https://github.com/apache/nutch/pull/779 - follow multiple redirects when fetching robots.txt - number of followed redirects is configurable by the property `http.robots.redirect.max` (default: 5) - improvements in

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-26 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1735261857 Converting this to draft until Hadoop 3.4.0 is released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL

[GitHub] [nutch] tballison merged pull request #778: NUTCH-3004

2023-09-26 Thread via GitHub
tballison merged PR #778: URL: https://github.com/apache/nutch/pull/778 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] lewismc commented on pull request #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub
lewismc commented on PR #779: URL: https://github.com/apache/nutch/pull/779#issuecomment-1735761972 Very nice @sebastian-nagel Do you have an example on hand of a robots.txt which can be fetched with >1 redirects? -- This is an automated message from the Apache Git Service. To

Re: [PR] NUTCH-2959 -- upgrade Tika to 2.9.0 [nutch]

2023-10-03 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1745032158 Local build of the shim works with 2.9.1-SNAPSHOT, which includes the latest version of POI which will conflict with commons-io in 3.4.0 if we don't use this shim (and/or if Hadoop doesn't

Re: [PR] NUTCH-2959 -- upgrade Tika to 2.9.0 [nutch]

2023-10-03 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1745177908 I published the shim artifacts for: https://github.com/tballison/hadoop-safe-tika It looks like they haven't made it into the main maven repositories yet. :( Once they do, I

Re: [PR] NUTCH-2853 bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean [nutch]

2023-10-03 Thread via GitHub
sebastian-nagel merged PR #785: URL: https://github.com/apache/nutch/pull/785 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-2897 Do not supress deprecated API warnings [nutch]

2023-10-03 Thread via GitHub
sebastian-nagel merged PR #784: URL: https://github.com/apache/nutch/pull/784 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-19 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1725714860 I'm guessing that commit won't work if distributed hadoop is bringing its own jars (as you said!). Does hadoop do any custom classloading so that the job jars are isolated from the

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-19 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1725801990 > Btw., I've just rediscovered that using Tika in (pseudo)distributed mode is broken since the upgrade to Tika 2.3.0, see [NUTCH-2937](https://issues.apache.org/jira/browse/NUTCH-2937).

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-19 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1725604218 Weird, I just pushed a commit bumping commons-io on my NUTCH-2959 branch, and it isn't showing up in the PR... I'll wait a bit... Maybe github is out for coffee? -- This is an

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-19 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1725611372 I haven't worked with ant in a while. According to `ant dependencytree`, it looks like we don't have to exclude commons-io everywhere -- placing it in the main ivy.xml has the same effect

[GitHub] [nutch] sebastian-nagel commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-19 Thread via GitHub
sebastian-nagel commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1725795918 > Can we exclude commons-io from hadoop and then add it as a dependency in the main ivy.xml? When running in distributed or pseudo-distributed mode, commons-io 2.8.0 is first

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-19 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1725746397 I'm getting a ConnectException when I try to run nutch-test-single-node-cluster. On hadoop startup, I see: ``` 2023-09-19 10:25:15,186 INFO util.GSet: VM type = 64-bit

[GitHub] [nutch] sebastian-nagel opened a new pull request, #777: NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-09-19 Thread via GitHub
sebastian-nagel opened a new pull request, #777: URL: https://github.com/apache/nutch/pull/777 - implement class CaseInsensitiveMetadata providing case-insensitive metadata look-ups (but no spell-checking) - use CaseInsensitiveMetadata to hold HTTP header metadata in in the class

[GitHub] [nutch] sebastian-nagel commented on pull request #779: NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-09-26 Thread via GitHub
sebastian-nagel commented on PR #779: URL: https://github.com/apache/nutch/pull/779#issuecomment-1735968193 > an example on hand of a robots.txt which can be fetched with >1 redirects? http://wikipedia.org/robots.txt Note: works with protocol-http, for protocol-okhttp need

[GitHub] [nutch] sebastian-nagel commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-26 Thread via GitHub
sebastian-nagel commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1736008780 > I suggest that we downgrade to Tika 2.2.1 to fix that regression. Good point, @lewismc. I've opened NUTCH-3006 for that. -- This is an automated message from the Apache

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-19 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1726191372 :sob: Y, let's hold off until Hadoop 3.4.0 is released. Thank you, again! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741125973 There is just no winning... We just upgraded POI to 5.2.4, and it uses a bunch of the newer commons-io methods. If we downgrade POI to 5.2.3, we get a clean build of Tika with

[GitHub] [nutch] tballison commented on pull request #776: NUTCH-2959 -- upgrade Tika to 2.9.0

2023-09-29 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1741143346 Stepping away from the keyboard. :sob: -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

[GitHub] [nutch] sebastian-nagel opened a new pull request, #780: NUTCH-2852 SpotBugs: Method invokes System.exit(...)

2023-09-28 Thread via GitHub
sebastian-nagel opened a new pull request, #780: URL: https://github.com/apache/nutch/pull/780 Remove all calls of System.exit(...) in methods except main(args) of various "checker" tools and replace by return values passed to main(). -- This is an automated message from the Apache Git

[GitHub] [nutch] sebastian-nagel opened a new pull request, #781: NUTCH-3007 Fix impossible casts

2023-09-28 Thread via GitHub
sebastian-nagel opened a new pull request, #781: URL: https://github.com/apache/nutch/pull/781 - remove code blocks (else clauses) unneeded and containing impossible casts -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and

Re: [PR] NUTCH-2959 -- upgrade Tika to 2.9.0 [nutch]

2023-10-10 Thread via GitHub
sebastian-nagel commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1755378665 Hi @tballison, I've tested the shim artifact in local and pseudo-distributed mode: everything looks good. - no more exceptions about CloseShieldInputStream.wrap(...) during

[PR] NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unarsed documents [nutch]

2023-10-09 Thread via GitHub
sebastian-nagel opened a new pull request, #787: URL: https://github.com/apache/nutch/pull/787 Use UTF-8 as fall-back encoding when stringifying the content of unparsed documents. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to

Re: [PR] NUTCH-2959 -- upgrade Tika to 2.9.0 [nutch]

2023-10-20 Thread via GitHub
tballison merged PR #776: URL: https://github.com/apache/nutch/pull/776 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-2959 -- upgrade Tika to 2.9.0 [nutch]

2023-10-20 Thread via GitHub
tballison commented on PR #776: URL: https://github.com/apache/nutch/pull/776#issuecomment-1773207035 Thank you so much @sebastian-nagel ! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the

Re: [PR] NUTCH-3002 Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive [nutch]

2023-10-21 Thread via GitHub
sebastian-nagel merged PR #777: URL: https://github.com/apache/nutch/pull/777 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3009 Upgrade to Hadoop 3.3.6 [nutch]

2023-10-21 Thread via GitHub
sebastian-nagel merged PR #782: URL: https://github.com/apache/nutch/pull/782 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-2990 HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309 [nutch]

2023-10-21 Thread via GitHub
sebastian-nagel merged PR #779: URL: https://github.com/apache/nutch/pull/779 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3011 HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) [nutch]

2023-10-21 Thread via GitHub
sebastian-nagel merged PR #786: URL: https://github.com/apache/nutch/pull/786 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

Re: [PR] NUTCH-3012 SegmentReader when dumping with option -recode: NPE on unarsed documents [nutch]

2023-10-21 Thread via GitHub
sebastian-nagel merged PR #787: URL: https://github.com/apache/nutch/pull/787 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail:

  1   2   3   >