[jira] [Resolved] (NUTCH-2812) Methods returning array may expose internal representation

2024-09-17 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2812?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2812. Resolution: Fixed > Methods returning array may expose internal representation > --

[jira] [Resolved] (NUTCH-1942) Remove TopLevelDomain

2024-09-17 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1942. Resolution: Done > Remove TopLevelDomain > -- > > Key:

[jira] [Resolved] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-09-17 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1806. Resolution: Implemented Thanks, everybody! > Delegate processing of URL domains to crawler

[jira] [Resolved] (NUTCH-3058) Fetcher: counter for hung threads

2024-09-16 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3058. Resolution: Implemented > Fetcher: counter for hung threads > -

[jira] [Commented] (NUTCH-3059) Generator: selector job does not count reduce output records

2024-09-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881792#comment-17881792 ] Sebastian Nagel commented on NUTCH-3059: Note: the above test was run in pseudo-d

[jira] [Commented] (NUTCH-3059) Generator: selector job does not count reduce output records

2024-09-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17881791#comment-17881791 ] Sebastian Nagel commented on NUTCH-3059: Ok, found the reason: it's because of [

[jira] [Resolved] (NUTCH-3061) URL filters to log name of the rule file rules are read from

2024-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3061?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3061. Resolution: Implemented > URL filters to log name of the rule file rules are read from > --

[jira] [Resolved] (NUTCH-3062) protocol-okhttp: optionally record HTTP and SSL/TLS versions

2024-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3062. Resolution: Implemented > protocol-okhttp: optionally record HTTP and SSL/TLS versions > --

[jira] [Resolved] (NUTCH-3065) Format changelog as Markdown

2024-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3065. Resolution: Implemented > Format changelog as Markdown > > >

[jira] [Resolved] (NUTCH-3066) Protocol plugin unit tests fail randomly

2024-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3066. Resolution: Fixed > Protocol plugin unit tests fail randomly >

[jira] [Commented] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-09-11 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880958#comment-17880958 ] Sebastian Nagel commented on NUTCH-1806: > it seems odd to return an empty String

[jira] [Created] (NUTCH-3067) Improve performance of FetchItemQueues if error state is preserved

2024-09-07 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3067: -- Summary: Improve performance of FetchItemQueues if error state is preserved Key: NUTCH-3067 URL: https://issues.apache.org/jira/browse/NUTCH-3067 Project: Nutch

[jira] [Commented] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2024-09-07 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17880036#comment-17880036 ] Sebastian Nagel commented on NUTCH-1806: Any comments on this? It's an important

[jira] [Resolved] (NUTCH-3063) Support for "addBinaryContent" from REST API

2024-09-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3063. Resolution: Implemented Committed in [ac03cf1|https://github.com/apache/nutch/commit/ac03c

[jira] [Commented] (NUTCH-3063) Support for "addBinaryContent" from REST API

2024-09-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879964#comment-17879964 ] Sebastian Nagel commented on NUTCH-3063: +1 looks good. And definitely makes sens

[jira] [Commented] (NUTCH-3065) Format changelog as Markdown

2024-09-05 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17879666#comment-17879666 ] Sebastian Nagel commented on NUTCH-3065: PR in progress: the [reformatted change

[jira] [Assigned] (NUTCH-3065) Format changelog as Markdown

2024-09-05 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3065?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3065: -- Assignee: Sebastian Nagel > Format changelog as Markdown > ---

[jira] [Created] (NUTCH-3065) Format changelog as Markdown

2024-09-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3065: -- Summary: Format changelog as Markdown Key: NUTCH-3065 URL: https://issues.apache.org/jira/browse/NUTCH-3065 Project: Nutch Issue Type: Improvement

[jira] [Updated] (NUTCH-3060) Javadoc link broken on website

2024-08-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3060: --- Description: The link to the 1.20 Javadocs on [https://nutch.apache.org/documentation/javadoc

[jira] [Commented] (NUTCH-3060) Javadoc link broken on website

2024-08-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17870291#comment-17870291 ] Sebastian Nagel commented on NUTCH-3060: The missing Javadocs are now placed on s

[jira] [Created] (NUTCH-3062) protocol-okhttp: optionally record HTTP and SSL/TLS versions

2024-07-09 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3062: -- Summary: protocol-okhttp: optionally record HTTP and SSL/TLS versions Key: NUTCH-3062 URL: https://issues.apache.org/jira/browse/NUTCH-3062 Project: Nutch

[jira] [Created] (NUTCH-3061) URL filters to log name of the rule file rules are read from

2024-07-09 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3061: -- Summary: URL filters to log name of the rule file rules are read from Key: NUTCH-3061 URL: https://issues.apache.org/jira/browse/NUTCH-3061 Project: Nutch

[jira] [Created] (NUTCH-3060) Javadoc link broken on website

2024-06-28 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3060: -- Summary: Javadoc link broken on website Key: NUTCH-3060 URL: https://issues.apache.org/jira/browse/NUTCH-3060 Project: Nutch Issue Type: Bug Co

[jira] [Updated] (NUTCH-3060) Javadoc link broken on website

2024-06-28 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3060: --- Fix Version/s: 1.21 (was: 1.20) > Javadoc link broken on website > ---

[jira] [Created] (NUTCH-3059) Generator: selector job does not count reduce output records

2024-06-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3059: -- Summary: Generator: selector job does not count reduce output records Key: NUTCH-3059 URL: https://issues.apache.org/jira/browse/NUTCH-3059 Project: Nutch

[jira] [Created] (NUTCH-3058) Fetcher: counter for hung threads

2024-06-05 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3058: -- Summary: Fetcher: counter for hung threads Key: NUTCH-3058 URL: https://issues.apache.org/jira/browse/NUTCH-3058 Project: Nutch Issue Type: Improvement

[jira] [Resolved] (NUTCH-3055) README: fix Github "hub" commands

2024-05-28 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3055?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3055. Resolution: Fixed > README: fix Github "hub" commands > - >

[jira] [Resolved] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-05-28 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3044?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3044. Resolution: Fixed > Generator: NPE when extracting the host part of a URL fails > -

[jira] [Resolved] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3043. Resolution: Implemented > Generator: count URLs rejected by URL filters > -

[jira] [Resolved] (NUTCH-3039) Failure to handle ftp:// URLs

2024-05-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3039. Resolution: Fixed > Failure to handle ftp:// URLs > - > >

[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands

2024-04-30 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3055: -- Summary: README: fix Github "hub" commands Key: NUTCH-3055 URL: https://issues.apache.org/jira/browse/NUTCH-3055 Project: Nutch Issue Type: Bug

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842291#comment-17842291 ] Sebastian Nagel commented on NUTCH-3028: +1 lgtm. One question: if there is no p

[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842284#comment-17842284 ] Sebastian Nagel commented on NUTCH-3045: See also NUTCH-2987. Until HADOOP-17177

[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3044: -- Summary: Generator: NPE when extracting the host part of a URL fails Key: NUTCH-3044 URL: https://issues.apache.org/jira/browse/NUTCH-3044 Project: Nutch

[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3043: -- Summary: Generator: count URLs rejected by URL filters Key: NUTCH-3043 URL: https://issues.apache.org/jira/browse/NUTCH-3043 Project: Nutch Issue Type: I

[jira] [Created] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3040: -- Summary: Upgrade to Hadoop 3.4.0 Key: NUTCH-3040 URL: https://issues.apache.org/jira/browse/NUTCH-3040 Project: Nutch Issue Type: Improvement C

[jira] [Assigned] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3039: -- Assignee: Sebastian Nagel > Failure to handle ftp:// URLs > --

[jira] [Created] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3039: -- Summary: Failure to handle ftp:// URLs Key: NUTCH-3039 URL: https://issues.apache.org/jira/browse/NUTCH-3039 Project: Nutch Issue Type: Bug Com

[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2937. Resolution: Fixed Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison]!

[jira] [Assigned] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2937: -- Assignee: Tim Allison > parse-tika: review dependency exclusions and avoid dependency

[jira] [Updated] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2937: --- Fix Version/s: 1.20 (was: 1.21) > parse-tika: review dependency exclus

[jira] [Resolved] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3005. Resolution: Implemented Done by [~lewismc] as part of NUTCH-3036, commit [1563396|https://

[jira] [Resolved] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3016. Resolution: Duplicate > Upgrade Apache Ivy to 2.5.2 > --- > >

[jira] [Updated] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3016: --- Fix Version/s: 1.20 (was: 1.21) > Upgrade Apache Ivy to 2.5.2 > --

[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Affects Version/s: 1.19 > Upgrade selenium as needed > -- > >

[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Fix Version/s: 1.20 > Upgrade selenium as needed > -- > >

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Affects Version/s: 1.19 > WARCExported to support filtering by JEXL > ---

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Fix Version/s: 1.21 > WARCExported to support filtering by JEXL > ---

[jira] [Resolved] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2960. Resolution: Won't Fix The license issue is addressed by NUTCH-3008. > indexer-elastic: rem

[jira] [Closed] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2960. -- > indexer-elastic: remove plugin from binary package to address licensing issues >

[jira] [Updated] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2960: --- Fix Version/s: (was: 1.20) > indexer-elastic: remove plugin from binary package to addres

[jira] [Resolved] (NUTCH-3008) indexer-elastic: downgrade to ES 7.10.2 to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3008. Resolution: Fixed > indexer-elastic: downgrade to ES 7.10.2 to address licensing issues > -

[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3029. Resolution: Implemented > Host specific max. and min. intervals in adaptive scheduler > ---

[jira] [Closed] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-3029. -- > Host specific max. and min. intervals in adaptive scheduler > ---

[jira] [Reopened] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-3029: Assignee: Sebastian Nagel (was: Markus Jelsma) Reopen to update "Fix version(s)" - add 1

[jira] [Updated] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3029: --- Fix Version/s: 1.20 > Host specific max. and min. intervals in adaptive scheduler > -

[jira] [Created] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-13 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3035: -- Summary: Update license and notice file for release of 1.20 Key: NUTCH-3035 URL: https://issues.apache.org/jira/browse/NUTCH-3035 Project: Nutch Issue T

[jira] [Resolved] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3025. Resolution: Implemented > urlfilter-fast to filter based on the length of the URL > ---

[jira] [Updated] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3025: --- Component/s: plugin urlfilter > urlfilter-fast to filter based on the length

[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17784030#comment-17784030 ] Sebastian Nagel commented on NUTCH-3017: Thanks, [~jnioche] > Allow fast-urlfilt

[jira] [Resolved] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3017. Resolution: Implemented > Allow fast-urlfilter to load from HDFS/S3 and support gzipped inp

[jira] [Updated] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3017: --- Component/s: plugin urlfilter > Allow fast-urlfilter to load from HDFS/S3 an

[jira] [Updated] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3017: --- Fix Version/s: 1.20 > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > -

[jira] [Resolved] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3012. Resolution: Fixed > SegmentReader when dumping with option -recode: NPE on unparsed documen

[jira] [Resolved] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3011. Resolution: Implemented > HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as s

[jira] [Resolved] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2990. Resolution: Implemented Thanks, everybody! > HttpRobotRulesParser to follow 5 redirects as

[jira] [Assigned] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3009: -- Assignee: Sebastian Nagel > Upgrade to Hadoop 3.3.6 > --- > >

[jira] [Resolved] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3009. Resolution: Implemented > Upgrade to Hadoop 3.3.6 > --- > >

[jira] [Resolved] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3006. Fix Version/s: (was: 1.20) Resolution: Abandoned > Downgrade Tika dependency to

[jira] [Assigned] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3002: -- Assignee: Sebastian Nagel > Protocol-okhttp HttpResponse: HTTP header metadata lookup

[jira] [Resolved] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3002. Resolution: Fixed > Protocol-okhttp HttpResponse: HTTP header metadata lookup should be >

[jira] [Commented] (NUTCH-3014) Standardize NutchJob job names

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17778103#comment-17778103 ] Sebastian Nagel commented on NUTCH-3014: If there is a single data name/directory

[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3012: --- Description: SegmentReader when called with the flag {{-recode}} fails with a NPE when tryin

[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3012: --- Summary: SegmentReader when dumping with option -recode: NPE on unparsed documents (was: Seg

[jira] [Created] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on documents without charset defined

2023-10-08 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3012: -- Summary: SegmentReader when dumping with option -recode: NPE on documents without charset defined Key: NUTCH-3012 URL: https://issues.apache.org/jira/browse/NUTCH-3012

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17771445#comment-17771445 ] Sebastian Nagel commented on NUTCH-2959: Hi [~tallison], it's your decision wheth

[jira] [Resolved] (NUTCH-1130) JUnit test for Any23 RDF plugin

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1130. Resolution: Won't Do Closing - the any23 project has retired and the any23 plugin was remov

[jira] [Closed] (NUTCH-1130) JUnit test for Any23 RDF plugin

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1130. -- > JUnit test for Any23 RDF plugin > --- > > Key: NUTCH-

[jira] [Resolved] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2938. Resolution: Won't Do Closing - the any23 project has retired and the any23 plugin was remov

[jira] [Closed] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2938. -- > Use Any23's RepositoryWriter to write structured data to Rdf4j repository > -

[jira] [Updated] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2938: --- Fix Version/s: (was: 1.20) > Use Any23's RepositoryWriter to write structured data to Rdf

[jira] [Resolved] (NUTCH-2853) bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2853. Resolution: Fixed > bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean >

[jira] [Resolved] (NUTCH-2897) Do not supress deprecated API warnings

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2897. Resolution: Fixed > Do not supress deprecated API warnings > --

[jira] [Resolved] (NUTCH-3010) Injector: count unique number of injected URLs

2023-10-02 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3010. Resolution: Fixed > Injector: count unique number of injected URLs > --

[jira] [Created] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-01 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3011: -- Summary: HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) Key: NUTCH-3011 URL: https://issues.apache.org/jira/browse/NUTCH-3011

[jira] [Closed] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1373. -- > Implement consistent execution of normalising and filtering in Generator > --

[jira] [Resolved] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1373. Resolution: Abandoned Closing as Nutch 2.x (aka. nutchgora) isn't maintained anymore. > Im

[jira] [Commented] (NUTCH-1374) Workaround for license headers

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770833#comment-17770833 ] Sebastian Nagel commented on NUTCH-1374: The package.html files were replaced by

[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17770831#comment-17770831 ] Sebastian Nagel commented on NUTCH-1635: Hi [~markus17], did this continue to hap

[jira] [Resolved] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1947. Resolution: Abandoned Closing because OutlinkExtractor has seen many updates since then: up

[jira] [Closed] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1947. -- > Overhaul o.a.n.parse.OutlinkExtractor.java > --- > >

[jira] [Resolved] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2053. Resolution: Abandoned Closing this old issue (8 years), assuming that dependencies have bee

[jira] [Closed] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2053. -- > Uncessary dependencies included in ivy.xml (post NUTCH-2038) > --

[jira] [Resolved] (NUTCH-2423) Update contributor info page

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2423. Fix Version/s: (was: 1.20) Resolution: Fixed The wiki pages were updated in 2020

[jira] [Resolved] (NUTCH-2820) Review sample files used in any23 unit tests

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2820. Resolution: Resolved Resolved with the removal of the any23 plugin (NUTCH-2998). > Review

[jira] [Resolved] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2888. Resolution: Duplicate Thanks, [~mmkivist]! This issue was resolved by NUTCH-2980 and will b

[jira] [Updated] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2888: --- Affects Version/s: 1.18 > Selenium Protocol: Support for Selenium 4 > ---

[jira] [Updated] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2888: --- Fix Version/s: 1.20 > Selenium Protocol: Support for Selenium 4 > ---

[jira] [Resolved] (NUTCH-3007) Fix impossible casts

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3007. Resolution: Fixed Thanks for the review, [~markus17]! > Fix impossible casts > ---

[jira] [Resolved] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2852. Resolution: Fixed > Method invokes System.exit(...) 9 bugs > --

  1   2   3   4   5   6   7   8   9   10   >