[
https://issues.apache.org/jira/browse/NUTCH-3043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3043.
Resolution: Implemented
> Generator: count URLs rejected by URL filt
[
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3039.
Resolution: Fixed
> Failure to handle ftp:// U
Sebastian Nagel created NUTCH-3055:
--
Summary: README: fix Github "hub" commands
Key: NUTCH-3055
URL: https://issues.apache.org/jira/browse/NUTCH-3055
Project: Nutch
Issue
[
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842291#comment-17842291
]
Sebastian Nagel commented on NUTCH-3028:
+1 lgtm.
One question: if there is no parseData
[
https://issues.apache.org/jira/browse/NUTCH-3045?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17842284#comment-17842284
]
Sebastian Nagel commented on NUTCH-3045:
See also NUTCH-2987. Until HADOOP-17177 / HADOOP-18887
Hi Lewis,
> The Jenkins job used to be run nightly but
> no longer is.
It pulls nightly from git:
https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/scmPollLog/
but a build is only run if there are new commits. The latest one:
Sebastian Nagel created NUTCH-3044:
--
Summary: Generator: NPE when extracting the host part of a URL
fails
Key: NUTCH-3044
URL: https://issues.apache.org/jira/browse/NUTCH-3044
Project: Nutch
Sebastian Nagel created NUTCH-3043:
--
Summary: Generator: count URLs rejected by URL filters
Key: NUTCH-3043
URL: https://issues.apache.org/jira/browse/NUTCH-3043
Project: Nutch
Issue Type
Sebastian Nagel created NUTCH-3040:
--
Summary: Upgrade to Hadoop 3.4.0
Key: NUTCH-3040
URL: https://issues.apache.org/jira/browse/NUTCH-3040
Project: Nutch
Issue Type: Improvement
https://github.com/sebastian-nagel/nutch-test-single-node-cluster/
One note about the CHANGES.md: it's now a mixture of HTML and plain text.
It does not use the potential of markdown, e.g. sections / headlines for
the releases to make the change log navigable via a table of contents.
The embedded
[
https://issues.apache.org/jira/browse/NUTCH-3039?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reassigned NUTCH-3039:
--
Assignee: Sebastian Nagel
> Failure to handle ftp:// U
Sebastian Nagel created NUTCH-3039:
--
Summary: Failure to handle ftp:// URLs
Key: NUTCH-3039
URL: https://issues.apache.org/jira/browse/NUTCH-3039
Project: Nutch
Issue Type: Bug
[
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2937.
Resolution: Fixed
Fixed NUTCH-2959 by using the shaded Tika package. Thanks, [~tallison
[
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reassigned NUTCH-2937:
--
Assignee: Tim Allison
> parse-tika: review dependency exclusions and avoid depende
[
https://issues.apache.org/jira/browse/NUTCH-2937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2937:
---
Fix Version/s: 1.20
(was: 1.21)
> parse-tika: review depende
[
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3005.
Resolution: Implemented
Done by [~lewismc] as part of NUTCH-3036, commit
[1563396|https
[
https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3016.
Resolution: Duplicate
> Upgrade Apache Ivy to 2.
[
https://issues.apache.org/jira/browse/NUTCH-3016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3016:
---
Fix Version/s: 1.20
(was: 1.21)
> Upgrade Apache Ivy to 2.
[
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3005:
---
Affects Version/s: 1.19
> Upgrade selenium as nee
[
https://issues.apache.org/jira/browse/NUTCH-3005?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3005:
---
Fix Version/s: 1.20
> Upgrade selenium as nee
[
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3028:
---
Affects Version/s: 1.19
> WARCExported to support filtering by J
[
https://issues.apache.org/jira/browse/NUTCH-3028?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3028:
---
Fix Version/s: 1.21
> WARCExported to support filtering by J
[
https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2960.
Resolution: Won't Fix
The license issue is addressed by NUTCH-3008.
> indexer-elas
[
https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel closed NUTCH-2960.
--
> indexer-elastic: remove plugin from binary package to address licensing iss
[
https://issues.apache.org/jira/browse/NUTCH-2960?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2960:
---
Fix Version/s: (was: 1.20)
> indexer-elastic: remove plugin from binary pack
[
https://issues.apache.org/jira/browse/NUTCH-3008?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3008.
Resolution: Fixed
> indexer-elastic: downgrade to ES 7.10.2 to address licensing iss
[
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3029.
Resolution: Implemented
> Host specific max. and min. intervals in adaptive schedu
[
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel closed NUTCH-3029.
--
> Host specific max. and min. intervals in adaptive schedu
[
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reopened NUTCH-3029:
Assignee: Sebastian Nagel (was: Markus Jelsma)
Reopen to update "Fix version(s)&q
[
https://issues.apache.org/jira/browse/NUTCH-3029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3029:
---
Fix Version/s: 1.20
> Host specific max. and min. intervals in adaptive schedu
Sebastian Nagel created NUTCH-3035:
--
Summary: Update license and notice file for release of 1.20
Key: NUTCH-3035
URL: https://issues.apache.org/jira/browse/NUTCH-3035
Project: Nutch
Issue
Hi Lewis,
yes, of course!
Some points we should do before the release:
- address the ES licensing issue,
the easiest way is to downgrade, see NUTCH-3008
If done update the license-related files.
- there are three short PRs open
I'll try to have a look at these points the next days.
[
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3025.
Resolution: Implemented
> urlfilter-fast to filter based on the length of the
[
https://issues.apache.org/jira/browse/NUTCH-3025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3025:
---
Component/s: plugin
urlfilter
> urlfilter-fast to filter ba
[
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17784030#comment-17784030
]
Sebastian Nagel commented on NUTCH-3017:
Thanks, [~jnioche]
> Allow fast-urlfilter to load f
[
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3017.
Resolution: Implemented
> Allow fast-urlfilter to load from HDFS/S3 and support gzip
[
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3017:
---
Component/s: plugin
urlfilter
> Allow fast-urlfilter to load from HDFS
[
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3017:
---
Fix Version/s: 1.20
> Allow fast-urlfilter to load from HDFS/S3 and support gzipped in
Hi Lewis,
>> whether we need a Nutch custom code style at all… why don’t we just use
>> some other existing style and then enforce it?
Enforcing: yes!
However, I would try hard to keep the changes on a reasonable minimum. For
example, if we change the indentation, almost every code line is
[
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3012.
Resolution: Fixed
> SegmentReader when dumping with option -recode: NPE on unpar
[
https://issues.apache.org/jira/browse/NUTCH-3011?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3011.
Resolution: Implemented
> HttpRobotRulesParser: handle HTTP 429 Too Many Requests s
[
https://issues.apache.org/jira/browse/NUTCH-2990?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2990.
Resolution: Implemented
Thanks, everybody!
> HttpRobotRulesParser to follow 5 redire
[
https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reassigned NUTCH-3009:
--
Assignee: Sebastian Nagel
> Upgrade to Hadoop 3.
[
https://issues.apache.org/jira/browse/NUTCH-3009?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3009.
Resolution: Implemented
> Upgrade to Hadoop 3.
[
https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3006.
Fix Version/s: (was: 1.20)
Resolution: Abandoned
> Downgrade Tika depende
[
https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reassigned NUTCH-3002:
--
Assignee: Sebastian Nagel
> Protocol-okhttp HttpResponse: HTTP header metadata loo
[
https://issues.apache.org/jira/browse/NUTCH-3002?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3002.
Resolution: Fixed
> Protocol-okhttp HttpResponse: HTTP header metadata lookup sho
[
https://issues.apache.org/jira/browse/NUTCH-3014?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17778103#comment-17778103
]
Sebastian Nagel commented on NUTCH-3014:
If there is a single data name/directory (CrawlDb
[
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3012:
---
Description:
SegmentReader when called with the flag {{-recode}} fails with a NPE when
[
https://issues.apache.org/jira/browse/NUTCH-3012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3012:
---
Summary: SegmentReader when dumping with option -recode: NPE on unparsed
documents
Sebastian Nagel created NUTCH-3012:
--
Summary: SegmentReader when dumping with option -recode: NPE on
documents without charset defined
Key: NUTCH-3012
URL: https://issues.apache.org/jira/browse/NUTCH-3012
[
https://issues.apache.org/jira/browse/NUTCH-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17771445#comment-17771445
]
Sebastian Nagel commented on NUTCH-2959:
Hi [~tallison], it's your decision whether the time
[
https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-1130.
Resolution: Won't Do
Closing - the any23 project has retired and the any23 plugin
[
https://issues.apache.org/jira/browse/NUTCH-1130?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel closed NUTCH-1130.
--
> JUnit test for Any23 RDF plugin
> ---
>
>
[
https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2938.
Resolution: Won't Do
Closing - the any23 project has retired and the any23 plugin
[
https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel closed NUTCH-2938.
--
> Use Any23's RepositoryWriter to write structured data to Rdf4j reposit
[
https://issues.apache.org/jira/browse/NUTCH-2938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2938:
---
Fix Version/s: (was: 1.20)
> Use Any23's RepositoryWriter to write structured d
[
https://issues.apache.org/jira/browse/NUTCH-2853?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2853.
Resolution: Fixed
> bin/nutch: remove deprecated commands solrindex, solrdedup, solrcl
[
https://issues.apache.org/jira/browse/NUTCH-2897?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2897.
Resolution: Fixed
> Do not supress deprecated API warni
[
https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3010.
Resolution: Fixed
> Injector: count unique number of injected U
Sebastian Nagel created NUTCH-3011:
--
Summary: HttpRobotRulesParser: handle HTTP 429 Too Many Requests
same as server errors (HTTP 5xx)
Key: NUTCH-3011
URL: https://issues.apache.org/jira/browse/NUTCH-3011
[
https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel closed NUTCH-1373.
--
> Implement consistent execution of normalising and filtering in Genera
[
https://issues.apache.org/jira/browse/NUTCH-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-1373.
Resolution: Abandoned
Closing as Nutch 2.x (aka. nutchgora) isn't maintained anymore
[
https://issues.apache.org/jira/browse/NUTCH-1374?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770833#comment-17770833
]
Sebastian Nagel commented on NUTCH-1374:
The package.html files were replaced by package
[
https://issues.apache.org/jira/browse/NUTCH-1635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770831#comment-17770831
]
Sebastian Nagel commented on NUTCH-1635:
Hi [~markus17], did this continue to happen in the last
[
https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-1947.
Resolution: Abandoned
Closing because OutlinkExtractor has seen many updates since
[
https://issues.apache.org/jira/browse/NUTCH-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel closed NUTCH-1947.
--
> Overhaul o.a.n.parse.OutlinkExtractor.j
[
https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2053.
Resolution: Abandoned
Closing this old issue (8 years), assuming that dependencies have
[
https://issues.apache.org/jira/browse/NUTCH-2053?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel closed NUTCH-2053.
--
> Uncessary dependencies included in ivy.xml (post NUTCH-2
[
https://issues.apache.org/jira/browse/NUTCH-2423?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2423.
Fix Version/s: (was: 1.20)
Resolution: Fixed
The wiki pages were updated
[
https://issues.apache.org/jira/browse/NUTCH-2820?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2820.
Resolution: Resolved
Resolved with the removal of the any23 plugin (NUTCH-2998).
> Rev
[
https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2888.
Resolution: Duplicate
Thanks, [~mmkivist]! This issue was resolved by NUTCH-2980
[
https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2888:
---
Affects Version/s: 1.18
> Selenium Protocol: Support for Seleniu
[
https://issues.apache.org/jira/browse/NUTCH-2888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-2888:
---
Fix Version/s: 1.20
> Selenium Protocol: Support for Seleniu
[
https://issues.apache.org/jira/browse/NUTCH-3007?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-3007.
Resolution: Fixed
Thanks for the review, [~markus17]!
> Fix impossible ca
[
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2852.
Resolution: Fixed
> Method invokes System.exit(...) 9 b
[
https://issues.apache.org/jira/browse/NUTCH-3010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3010:
---
Description:
Injector uses two counters: one for the total number of injected URLs
Sebastian Nagel created NUTCH-3010:
--
Summary: Injector: count unique number of injected URLs
Key: NUTCH-3010
URL: https://issues.apache.org/jira/browse/NUTCH-3010
Project: Nutch
Issue Type
On Thu, Sep 28, 2023 at 9:29 AM Tim Allison <mailto:talli...@apache.org>> wrote:
Y, I'd like to get a working Tika version in a release fairly soon. Not sure
how much effort a release is?
On Thu, Sep 28, 2023 at 8:29 AM Sebastian Nagel mailto:sna...@apache.org>> wrote
[
https://issues.apache.org/jira/browse/NUTCH-3006?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770320#comment-17770320
]
Sebastian Nagel commented on NUTCH-3006:
> revert CloseShieldInputStream.wrap(), which I th
[
https://issues.apache.org/jira/browse/NUTCH-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17770041#comment-17770041
]
Sebastian Nagel commented on NUTCH-2979:
Note: upgrading to Hadoop 3.3.6 (NUTCH-3009) will update
Sebastian Nagel created NUTCH-3009:
--
Summary: Upgrade to Hadoop 3.3.6
Key: NUTCH-3009
URL: https://issues.apache.org/jira/browse/NUTCH-3009
Project: Nutch
Issue Type: Improvement
[
https://issues.apache.org/jira/browse/NUTCH-2979?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel resolved NUTCH-2979.
Resolution: Fixed
Resolved, so far, without any direct action:
- Nutch core still depends
Hi Lewis,
thanks!
I'd put on top of the list
* release 1.20
Since the release of 1.19 more than one year has elapsed.
Otherwise I agree with all points on the road map, even
in this order / priority.
Best,
Sebastian
On 9/26/23 18:37, lewis john mcgibbney wrote:
Hi dev@,
I've been at
Sebastian Nagel created NUTCH-3008:
--
Summary: indexer-elastic: downgrade to ES 7.10.2 to address
licensing issues
Key: NUTCH-3008
URL: https://issues.apache.org/jira/browse/NUTCH-3008
Project: Nutch
Sebastian Nagel created NUTCH-3007:
--
Summary: Fix impossible casts
Key: NUTCH-3007
URL: https://issues.apache.org/jira/browse/NUTCH-3007
Project: Nutch
Issue Type: Sub-task
Affects
[
https://issues.apache.org/jira/browse/NUTCH-2852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17769977#comment-17769977
]
Sebastian Nagel commented on NUTCH-2852:
The PR addresses all corresponding issues in the checker
Sebastian Nagel created NUTCH-3006:
--
Summary: Downgrade Tika dependency to 2.2.1 (core and parse-tika)
Key: NUTCH-3006
URL: https://issues.apache.org/jira/browse/NUTCH-3006
Project: Nutch
[
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3004:
---
Fix Version/s: 1.20
> Avoid NPE in HttpRespo
[
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3004:
---
Component/s: plugin
protocol
> Avoid NPE in HttpRespo
[
https://issues.apache.org/jira/browse/NUTCH-3004?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-3004:
---
Affects Version/s: 1.19
> Avoid NPE in HttpRespo
[
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-585:
--
Priority: Major (was: Minor)
> [PARSE-HTML plugin] Block certain parts of HTML code from be
[
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-585:
--
Component/s: parse-filter
HTML
parser
plugin
[
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel reassigned NUTCH-585:
-
Assignee: Sebastian Nagel (was: Markus Jelsma)
> [PARSE-HTML plugin] Block cert
[
https://issues.apache.org/jira/browse/NUTCH-585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sebastian Nagel updated NUTCH-585:
--
Fix Version/s: 1.20
> [PARSE-HTML plugin] Block certain parts of HTML code from being inde
Sebastian Nagel created NUTCH-3002:
--
Summary: Protocol-okhttp HttpResponse: HTTP header metadata lookup
should be case-insensitive
Key: NUTCH-3002
URL: https://issues.apache.org/jira/browse/NUTCH-3002
+1
Since any23 also depends on tika-core, the plugin is likely to break if we
upgrade to a more recent Tika version in Nutch core and the parse-tika plugin.
~Sebastian
On 9/13/23 16:50, Tim Allison wrote:
All,
I opened https://issues.apache.org/jira/browse/NUTCH-2998
[
https://issues.apache.org/jira/browse/NUTCH-3000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764692#comment-17764692
]
Sebastian Nagel commented on NUTCH-3000:
+1 Yes, the full HTML seems the best choice
[
https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672
]
Sebastian Nagel edited comment on NUTCH-2998 at 9/13/23 1:26 PM:
-
+1
[
https://issues.apache.org/jira/browse/NUTCH-2998?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17764672#comment-17764672
]
Sebastian Nagel commented on NUTCH-2998:
+1
> Remove the Any23 plu
1 - 100 of 3546 matches
Mail list logo