[jira] [Resolved] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-05-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3043. Resolution: Implemented > Generator: count URLs rejected by URL filt

[jira] [Resolved] (NUTCH-3039) Failure to handle ftp:// URLs

2024-05-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3039. Resolution: Fixed > Failure to handle ftp:// U

[jira] [Created] (NUTCH-3055) README: fix Github "hub" commands

2024-04-30 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3055: -- Summary: README: fix Github "hub" commands Key: NUTCH-3055 URL: https://issues.apache.org/jira/browse/NUTCH-3055 Project: Nutch Issue

[jira] [Commented] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842291#comment-17842291 ] Sebastian Nagel commented on NUTCH-3028: +1 lgtm. One question: if there is no parseData

[jira] [Commented] (NUTCH-3045) Upgrade from Java 11 to 17

2024-04-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842284#comment-17842284 ] Sebastian Nagel commented on NUTCH-3045: See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887

Re: [DISCUSS] Consolidating Nutch Continuous Integration

2024-04-28 Thread Sebastian Nagel
Hi Lewis, > The Jenkins job used to be run nightly but > no longer is. It pulls nightly from git: https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/scmPollLog/ but a build is only run if there are new commits. The latest one:

[jira] [Created] (NUTCH-3044) Generator: NPE when extracting the host part of a URL fails

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3044: -- Summary: Generator: NPE when extracting the host part of a URL fails Key: NUTCH-3044 URL: https://issues.apache.org/jira/browse/NUTCH-3044 Project: Nutch

[jira] [Created] (NUTCH-3043) Generator: count URLs rejected by URL filters

2024-04-25 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3043: -- Summary: Generator: count URLs rejected by URL filters Key: NUTCH-3043 URL: https://issues.apache.org/jira/browse/NUTCH-3043 Project: Nutch Issue Type

[jira] [Created] (NUTCH-3040) Upgrade to Hadoop 3.4.0

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3040: -- Summary: Upgrade to Hadoop 3.4.0 Key: NUTCH-3040 URL: https://issues.apache.org/jira/browse/NUTCH-3040 Project: Nutch Issue Type: Improvement

Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Sebastian Nagel
https://github.com/sebastian-nagel/nutch-test-single-node-cluster/ One note about the CHANGES.md: it's now a mixture of HTML and plain text. It does not use the potential of markdown, e.g. sections / headlines for the releases to make the change log navigable via a table of contents. The embedded

[jira] [Assigned] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3039: -- Assignee: Sebastian Nagel > Failure to handle ftp:// U

[jira] [Created] (NUTCH-3039) Failure to handle ftp:// URLs

2024-04-11 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3039: -- Summary: Failure to handle ftp:// URLs Key: NUTCH-3039 URL: https://issues.apache.org/jira/browse/NUTCH-3039 Project: Nutch Issue Type: Bug

[jira] [Resolved] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2937. Resolution: Fixed Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison

[jira] [Assigned] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2937: -- Assignee: Tim Allison > parse-tika: review dependency exclusions and avoid depende

[jira] [Updated] (NUTCH-2937) parse-tika: review dependency exclusions and avoid dependency conflicts in distributed mode

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2937: --- Fix Version/s: 1.20 (was: 1.21) > parse-tika: review depende

[jira] [Resolved] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3005. Resolution: Implemented Done by [~lewismc] as part of NUTCH-3036, commit [1563396|https

[jira] [Resolved] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3016. Resolution: Duplicate > Upgrade Apache Ivy to 2.

[jira] [Updated] (NUTCH-3016) Upgrade Apache Ivy to 2.5.2

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3016: --- Fix Version/s: 1.20 (was: 1.21) > Upgrade Apache Ivy to 2.

[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Affects Version/s: 1.19 > Upgrade selenium as nee

[jira] [Updated] (NUTCH-3005) Upgrade selenium as needed

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3005: --- Fix Version/s: 1.20 > Upgrade selenium as nee

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Affects Version/s: 1.19 > WARCExported to support filtering by J

[jira] [Updated] (NUTCH-3028) WARCExported to support filtering by JEXL

2024-04-06 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3028: --- Fix Version/s: 1.21 > WARCExported to support filtering by J

[jira] [Resolved] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2960. Resolution: Won't Fix The license issue is addressed by NUTCH-3008. > indexer-elas

[jira] [Closed] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2960. -- > indexer-elastic: remove plugin from binary package to address licensing iss

[jira] [Updated] (NUTCH-2960) indexer-elastic: remove plugin from binary package to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2960: --- Fix Version/s: (was: 1.20) > indexer-elastic: remove plugin from binary pack

[jira] [Resolved] (NUTCH-3008) indexer-elastic: downgrade to ES 7.10.2 to address licensing issues

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3008. Resolution: Fixed > indexer-elastic: downgrade to ES 7.10.2 to address licensing iss

[jira] [Resolved] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3029. Resolution: Implemented > Host specific max. and min. intervals in adaptive schedu

[jira] [Closed] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-3029. -- > Host specific max. and min. intervals in adaptive schedu

[jira] [Reopened] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reopened NUTCH-3029: Assignee: Sebastian Nagel (was: Markus Jelsma) Reopen to update "Fix version(s)&q

[jira] [Updated] (NUTCH-3029) Host specific max. and min. intervals in adaptive scheduler

2024-03-14 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3029: --- Fix Version/s: 1.20 > Host specific max. and min. intervals in adaptive schedu

[jira] [Created] (NUTCH-3035) Update license and notice file for release of 1.20

2024-03-13 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3035: -- Summary: Update license and notice file for release of 1.20 Key: NUTCH-3035 URL: https://issues.apache.org/jira/browse/NUTCH-3035 Project: Nutch Issue

Re: [DISCUSS] Release Nutch 1.20

2024-03-09 Thread Sebastian Nagel
Hi Lewis, yes, of course! Some points we should do before the release: - address the ES licensing issue, the easiest way is to downgrade, see NUTCH-3008 If done update the license-related files. - there are three short PRs open I'll try to have a look at these points the next days.

[jira] [Resolved] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3025. Resolution: Implemented > urlfilter-fast to filter based on the length of the

[jira] [Updated] (NUTCH-3025) urlfilter-fast to filter based on the length of the URL

2023-11-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3025: --- Component/s: plugin urlfilter > urlfilter-fast to filter ba

[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784030#comment-17784030 ] Sebastian Nagel commented on NUTCH-3017: Thanks, [~jnioche] > Allow fast-urlfilter to load f

[jira] [Resolved] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-11-08 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3017. Resolution: Implemented > Allow fast-urlfilter to load from HDFS/S3 and support gzip

[jira] [Updated] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3017: --- Component/s: plugin urlfilter > Allow fast-urlfilter to load from HDFS

[jira] [Updated] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

2023-10-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3017: --- Fix Version/s: 1.20 > Allow fast-urlfilter to load from HDFS/S3 and support gzipped in

Re: Nutch codebase formatting

2023-10-29 Thread Sebastian Nagel
Hi Lewis, >> whether we need a Nutch custom code style at all… why don’t we just use >> some other existing style and then enforce it? Enforcing: yes! However, I would try hard to keep the changes on a reasonable minimum. For example, if we change the indentation, almost every code line is

[jira] [Resolved] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3012. Resolution: Fixed > SegmentReader when dumping with option -recode: NPE on unpar

[jira] [Resolved] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3011. Resolution: Implemented > HttpRobotRulesParser: handle HTTP 429 Too Many Requests s

[jira] [Resolved] (NUTCH-2990) HttpRobotRulesParser to follow 5 redirects as specified by RFC 9309

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2990. Resolution: Implemented Thanks, everybody! > HttpRobotRulesParser to follow 5 redire

[jira] [Assigned] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3009: -- Assignee: Sebastian Nagel > Upgrade to Hadoop 3.

[jira] [Resolved] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3009. Resolution: Implemented > Upgrade to Hadoop 3.

[jira] [Resolved] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3006. Fix Version/s: (was: 1.20) Resolution: Abandoned > Downgrade Tika depende

[jira] [Assigned] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-3002: -- Assignee: Sebastian Nagel > Protocol-okhttp HttpResponse: HTTP header metadata loo

[jira] [Resolved] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3002. Resolution: Fixed > Protocol-okhttp HttpResponse: HTTP header metadata lookup sho

[jira] [Commented] (NUTCH-3014) Standardize NutchJob job names

2023-10-21 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778103#comment-17778103 ] Sebastian Nagel commented on NUTCH-3014: If there is a single data name/directory (CrawlDb

[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3012: --- Description: SegmentReader when called with the flag {{-recode}} fails with a NPE when

[jira] [Updated] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on unparsed documents

2023-10-09 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3012: --- Summary: SegmentReader when dumping with option -recode: NPE on unparsed documents

[jira] [Created] (NUTCH-3012) SegmentReader when dumping with option -recode: NPE on documents without charset defined

2023-10-09 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3012: -- Summary: SegmentReader when dumping with option -recode: NPE on documents without charset defined Key: NUTCH-3012 URL: https://issues.apache.org/jira/browse/NUTCH-3012

[jira] [Commented] (NUTCH-2959) Upgrade to Apache Tika 2.9.0

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771445#comment-17771445 ] Sebastian Nagel commented on NUTCH-2959: Hi [~tallison], it's your decision whether the time

[jira] [Resolved] (NUTCH-1130) JUnit test for Any23 RDF plugin

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1130. Resolution: Won't Do Closing - the any23 project has retired and the any23 plugin

[jira] [Closed] (NUTCH-1130) JUnit test for Any23 RDF plugin

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1130. -- > JUnit test for Any23 RDF plugin > --- > >

[jira] [Resolved] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2938. Resolution: Won't Do Closing - the any23 project has retired and the any23 plugin

[jira] [Closed] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2938. -- > Use Any23's RepositoryWriter to write structured data to Rdf4j reposit

[jira] [Updated] (NUTCH-2938) Use Any23's RepositoryWriter to write structured data to Rdf4j repository

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2938: --- Fix Version/s: (was: 1.20) > Use Any23's RepositoryWriter to write structured d

[jira] [Resolved] (NUTCH-2853) bin/nutch: remove deprecated commands solrindex, solrdedup, solrclean

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2853. Resolution: Fixed > bin/nutch: remove deprecated commands solrindex, solrdedup, solrcl

[jira] [Resolved] (NUTCH-2897) Do not supress deprecated API warnings

2023-10-03 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2897. Resolution: Fixed > Do not supress deprecated API warni

[jira] [Resolved] (NUTCH-3010) Injector: count unique number of injected URLs

2023-10-02 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3010. Resolution: Fixed > Injector: count unique number of injected U

[jira] [Created] (NUTCH-3011) HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx)

2023-10-01 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3011: -- Summary: HttpRobotRulesParser: handle HTTP 429 Too Many Requests same as server errors (HTTP 5xx) Key: NUTCH-3011 URL: https://issues.apache.org/jira/browse/NUTCH-3011

[jira] [Closed] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1373. -- > Implement consistent execution of normalising and filtering in Genera

[jira] [Resolved] (NUTCH-1373) Implement consistent execution of normalising and filtering in Generator

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1373. Resolution: Abandoned Closing as Nutch 2.x (aka. nutchgora) isn't maintained anymore

[jira] [Commented] (NUTCH-1374) Workaround for license headers

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770833#comment-17770833 ] Sebastian Nagel commented on NUTCH-1374: The package.html files were replaced by package

[jira] [Commented] (NUTCH-1635) New crawldb sometimes ends up in current

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770831#comment-17770831 ] Sebastian Nagel commented on NUTCH-1635: Hi [~markus17], did this continue to happen in the last

[jira] [Resolved] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-1947. Resolution: Abandoned Closing because OutlinkExtractor has seen many updates since

[jira] [Closed] (NUTCH-1947) Overhaul o.a.n.parse.OutlinkExtractor.java

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-1947. -- > Overhaul o.a.n.parse.OutlinkExtractor.j

[jira] [Resolved] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2053. Resolution: Abandoned Closing this old issue (8 years), assuming that dependencies have

[jira] [Closed] (NUTCH-2053) Uncessary dependencies included in ivy.xml (post NUTCH-2038)

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel closed NUTCH-2053. -- > Uncessary dependencies included in ivy.xml (post NUTCH-2

[jira] [Resolved] (NUTCH-2423) Update contributor info page

2023-10-01 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2423. Fix Version/s: (was: 1.20) Resolution: Fixed The wiki pages were updated

[jira] [Resolved] (NUTCH-2820) Review sample files used in any23 unit tests

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2820. Resolution: Resolved Resolved with the removal of the any23 plugin (NUTCH-2998). > Rev

[jira] [Resolved] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2888. Resolution: Duplicate Thanks, [~mmkivist]! This issue was resolved by NUTCH-2980

[jira] [Updated] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2888: --- Affects Version/s: 1.18 > Selenium Protocol: Support for Seleniu

[jira] [Updated] (NUTCH-2888) Selenium Protocol: Support for Selenium 4

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2888: --- Fix Version/s: 1.20 > Selenium Protocol: Support for Seleniu

[jira] [Resolved] (NUTCH-3007) Fix impossible casts

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-3007. Resolution: Fixed Thanks for the review, [~markus17]! > Fix impossible ca

[jira] [Resolved] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2852. Resolution: Fixed > Method invokes System.exit(...) 9 b

[jira] [Updated] (NUTCH-3010) Injector: count unique number of injected URLs

2023-09-30 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3010: --- Description: Injector uses two counters: one for the total number of injected URLs

[jira] [Created] (NUTCH-3010) Injector: count unique number of injected URLs

2023-09-30 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3010: -- Summary: Injector: count unique number of injected URLs Key: NUTCH-3010 URL: https://issues.apache.org/jira/browse/NUTCH-3010 Project: Nutch Issue Type

Re: Establishing a Nutch development roadmap

2023-09-29 Thread Sebastian Nagel
On Thu, Sep 28, 2023 at 9:29 AM Tim Allison <mailto:talli...@apache.org>> wrote: Y, I'd like to get a working Tika version in a release fairly soon. Not sure how much effort a release is? On Thu, Sep 28, 2023 at 8:29 AM Sebastian Nagel mailto:sna...@apache.org>> wrote

[jira] [Commented] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-09-29 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770320#comment-17770320 ] Sebastian Nagel commented on NUTCH-3006: > revert CloseShieldInputStream.wrap(), which I th

[jira] [Commented] (NUTCH-2979) Upgrade Commons Text to 1.10.0

2023-09-28 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770041#comment-17770041 ] Sebastian Nagel commented on NUTCH-2979: Note: upgrading to Hadoop 3.3.6 (NUTCH-3009) will update

[jira] [Created] (NUTCH-3009) Upgrade to Hadoop 3.3.6

2023-09-28 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3009: -- Summary: Upgrade to Hadoop 3.3.6 Key: NUTCH-3009 URL: https://issues.apache.org/jira/browse/NUTCH-3009 Project: Nutch Issue Type: Improvement

[jira] [Resolved] (NUTCH-2979) Upgrade Commons Text to 1.10.0

2023-09-28 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2979. Resolution: Fixed Resolved, so far, without any direct action: - Nutch core still depends

Re: Establishing a Nutch development roadmap

2023-09-28 Thread Sebastian Nagel
Hi Lewis, thanks! I'd put on top of the list * release 1.20 Since the release of 1.19 more than one year has elapsed. Otherwise I agree with all points on the road map, even in this order / priority. Best, Sebastian On 9/26/23 18:37, lewis john mcgibbney wrote: Hi dev@, I've been at

[jira] [Created] (NUTCH-3008) indexer-elastic: downgrade to ES 7.10.2 to address licensing issues

2023-09-28 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3008: -- Summary: indexer-elastic: downgrade to ES 7.10.2 to address licensing issues Key: NUTCH-3008 URL: https://issues.apache.org/jira/browse/NUTCH-3008 Project: Nutch

[jira] [Created] (NUTCH-3007) Fix impossible casts

2023-09-28 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3007: -- Summary: Fix impossible casts Key: NUTCH-3007 URL: https://issues.apache.org/jira/browse/NUTCH-3007 Project: Nutch Issue Type: Sub-task Affects

[jira] [Commented] (NUTCH-2852) Method invokes System.exit(...) 9 bugs

2023-09-28 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769977#comment-17769977 ] Sebastian Nagel commented on NUTCH-2852: The PR addresses all corresponding issues in the checker

[jira] [Created] (NUTCH-3006) Downgrade Tika dependency to 2.2.1 (core and parse-tika)

2023-09-26 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3006: -- Summary: Downgrade Tika dependency to 2.2.1 (core and parse-tika) Key: NUTCH-3006 URL: https://issues.apache.org/jira/browse/NUTCH-3006 Project: Nutch

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Fix Version/s: 1.20 > Avoid NPE in HttpRespo

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Component/s: plugin protocol > Avoid NPE in HttpRespo

[jira] [Updated] (NUTCH-3004) Avoid NPE in HttpResponse

2023-09-26 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-3004: --- Affects Version/s: 1.19 > Avoid NPE in HttpRespo

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-585: -- Priority: Major (was: Minor) > [PARSE-HTML plugin] Block certain parts of HTML code from be

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-585: -- Component/s: parse-filter HTML parser plugin

[jira] [Assigned] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-585: - Assignee: Sebastian Nagel (was: Markus Jelsma) > [PARSE-HTML plugin] Block cert

[jira] [Updated] (NUTCH-585) [PARSE-HTML plugin] Block certain parts of HTML code from being indexed

2023-09-23 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-585: -- Fix Version/s: 1.20 > [PARSE-HTML plugin] Block certain parts of HTML code from being inde

[jira] [Created] (NUTCH-3002) Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive

2023-09-18 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-3002: -- Summary: Protocol-okhttp HttpResponse: HTTP header metadata lookup should be case-insensitive Key: NUTCH-3002 URL: https://issues.apache.org/jira/browse/NUTCH-3002

Re: [DISCUSS] Removing Any23 from Nutch?

2023-09-13 Thread Sebastian Nagel
+1 Since any23 also depends on tika-core, the plugin is likely to break if we upgrade to a more recent Tika version in Nutch core and the parse-tika plugin. ~Sebastian On 9/13/23 16:50, Tim Allison wrote: All,   I opened https://issues.apache.org/jira/browse/NUTCH-2998

[jira] [Commented] (NUTCH-3000) protocol-selenium returns only the body,strips off the element

2023-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764692#comment-17764692 ] Sebastian Nagel commented on NUTCH-3000: +1 Yes, the full HTML seems the best choice

[jira] [Comment Edited] (NUTCH-2998) Remove the Any23 plugin

2023-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672 ] Sebastian Nagel edited comment on NUTCH-2998 at 9/13/23 1:26 PM: - +1

[jira] [Commented] (NUTCH-2998) Remove the Any23 plugin

2023-09-13 Thread Sebastian Nagel (Jira)
[ https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672 ] Sebastian Nagel commented on NUTCH-2998: +1 > Remove the Any23 plu

  1   2   3   4   5   6   7   8   9   10   >