[ANNOUNCE] Apache Nutch 1.20 Release
The Apache Nutch Project https://nutch.apache.org/download/ Please verify signatures using the KEYS file https://raw.githubusercontent.com/apache/nutch/master/KEYS when downloading the release. This release includes more than 60 bug fixes and improvements, the full list of changes can be seen in the Jira release report https://s.apache.org/ovjf3 Thanks to everyone who contributed to this release! -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[RESULT] WAS Re: [VOTE] Apache Nutch 1.20 Release
Hi user@ & dev@, I’m glad to conclude the Nutch 1.20 release candidate VOTE thread with the following RESULT’s. [5] +1 Release this package as Apache Nutch 1.20 snagel* balakuntala* blackice* Joe Gilvary lewismc* [ ] -1 Do not release this package because… *Nutch Project Management Committee-binding The Nutch 1.20 release candidate has passed the community VOTE. I will therefore promote this release casndidate. Thanks for VOTE’ing and for everyone who contributed to the Apache Nutch 1.20 release. lewismc On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.20 release is available at [0] where > accompanying SHA512 and ASC signatures can also be found. > Information on verifying releases can be found at [1]. > > The release candidate comprises a .zip and tar.gz archive of the sources > at [2] and complementary binary distributions. In addition, a staged maven > repository is available at [3]. > > The Nutch 1.20 release report is available at [4]. > > Please vote on releasing this package as Apache Nutch 1.20. The vote is > open for at least the next 72 hours and passes if a majority of at least > three +1 Nutch PMC votes are cast. > > [ ] +1 Release this package as Apache Nutch 1.20. > > [ ] -1 Do not release this package because… > > Cheers, > lewismc > P.S. Here is my +1. > > [0] https://dist.apache.org/repos/dist/dev/nutch/1.20 > [1] http://nutch.apache.org/downloads.html#verify > [2] https://github.com/apache/nutch/tree/release-1.20 > [3] > https://repository.apache.org/content/repositories/orgapachenutch-1021/ > [4] https://s.apache.org/ovjf3 > > -- > http://home.apache.org/~lewismc/ > http://people.apache.org/keys/committer/lewismc > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: [VOTE] Apache Nutch 1.20 Release
Hi user@, dev@, Please consider reviewing the Nutch 1.20 release candidate. This is a critical prerequisite for us making releases of software at TheASF. Thank you lewismc On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney wrote: > Hi Folks, > > A first candidate for the Nutch 1.20 release is available at [0] where > accompanying SHA512 and ASC signatures can also be found. > Information on verifying releases can be found at [1]. > > The release candidate comprises a .zip and tar.gz archive of the sources > at [2] and complementary binary distributions. In addition, a staged maven > repository is available at [3]. > > The Nutch 1.20 release report is available at [4]. > > Please vote on releasing this package as Apache Nutch 1.20. The vote is > open for at least the next 72 hours and passes if a majority of at least > three +1 Nutch PMC votes are cast. > > [ ] +1 Release this package as Apache Nutch X.XX. > > [ ] -1 Do not release this package because… > > Cheers, > lewismc > P.S. Here is my +1. > > [0] https://dist.apache.org/repos/dist/dev/nutch/1.20 > [1] http://nutch.apache.org/downloads.html#verify > [2] https://github.com/apache/nutch/tree/release-1.20 > [3] > https://repository.apache.org/content/repositories/orgapachenutch-1021/ > [4] https://s.apache.org/ovjf3 > > -- > http://home.apache.org/~lewismc/ > http://people.apache.org/keys/committer/lewismc > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: [VOTE] Apache Nutch 1.20 Release
Hi Lewis, here's my +1 * signatures of release packages are valid * build from the source package successful, unit tests pass * tested few Nutch tools in the binary package (local mode) * run a sample crawl and tested many Nutch tools on a single-node cluster running Hadoop 3.4.0, see https://github.com/sebastian-nagel/nutch-test-single-node-cluster/ One note about the CHANGES.md: it's now a mixture of HTML and plain text. It does not use the potential of markdown, e.g. sections / headlines for the releases to make the change log navigable via a table of contents. The embedded HTML makes it less readable if viewed in a text editor. The rendering on Github [5] is acceptable with only minor glitches, mostly the placement of multiple lines in a single paragraph: https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md We also have a change log on Jira: https://s.apache.org/ovjf3 That's why I wouldn't call the CHANGES.md a "blocker". We should update the formatting after the release to make it again easily readable in source code and improve the document structure utilizing the markdown markup. ~Sebastian On 4/9/24 23:28, lewis john mcgibbney wrote: Hi Folks, A first candidate for the Nutch 1.20 release is available at [0] where accompanying SHA512 and ASC signatures can also be found. Information on verifying releases can be found at [1]. The release candidate comprises a .zip and tar.gz archive of the sources at [2] and complementary binary distributions. In addition, a staged maven repository is available at [3]. The Nutch 1.20 release report is available at [4]. Please vote on releasing this package as Apache Nutch 1.20. The vote is open for at least the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch X.XX. [ ] -1 Do not release this package because… Cheers, lewismc P.S. Here is my +1. [0] https://dist.apache.org/repos/dist/dev/nutch/1.20 <https://dist.apache.org/repos/dist/dev/nutch/1.20> [1] http://nutch.apache.org/downloads.html#verify <http://nutch.apache.org/downloads.html#verify> [2] https://github.com/apache/nutch/tree/release-1.20 <https://github.com/apache/nutch/tree/release-1.20> [3] https://repository.apache.org/content/repositories/orgapachenutch-1021/ <https://repository.apache.org/content/repositories/orgapachenutch-1021/> [4] https://s.apache.org/ovjf3 <https://s.apache.org/ovjf3> -- http://home.apache.org/~lewismc/ <http://home.apache.org/~lewismc/> http://people.apache.org/keys/committer/lewismc <http://people.apache.org/keys/committer/lewismc>
[VOTE] Apache Nutch 1.20 Release
Hi Folks, A first candidate for the Nutch 1.20 release is available at [0] where accompanying SHA512 and ASC signatures can also be found. Information on verifying releases can be found at [1]. The release candidate comprises a .zip and tar.gz archive of the sources at [2] and complementary binary distributions. In addition, a staged maven repository is available at [3]. The Nutch 1.20 release report is available at [4]. Please vote on releasing this package as Apache Nutch 1.20. The vote is open for at least the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch X.XX. [ ] -1 Do not release this package because… Cheers, lewismc P.S. Here is my +1. [0] https://dist.apache.org/repos/dist/dev/nutch/1.20 [1] http://nutch.apache.org/downloads.html#verify [2] https://github.com/apache/nutch/tree/release-1.20 [3] https://repository.apache.org/content/repositories/orgapachenutch-1021/ [4] https://s.apache.org/ovjf3 -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
[GSoC 2024 PROPOSAL] Overhaul the legacy Nutch plugin framework and replace it with PF4J
Hi user@ & dev@, I decided to write up a GSoC’24 proposal and encourage interested applicants to register your interest in the JIRA issue or else reach out to the Nutch PMC over on d...@nutch.apache.org (please CC lewi...@apache.org). Title: Overhaul the legacy Nutch plugin framework and replace it with PF4J JIRA: https://issues.apache.org/jira/browse/NUTCH-3034 Thanks in advance, and good luck to prospective GSoC applicants. lewismc -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: nutch adds %20 in urls instead of spaces
Thanks for the response Markus. disabling urlnormalizer-basic works. On Tue, Jan 9, 2024 at 3:43 PM Markus Jelsma wrote: > Hello Steve, > > Having those spaces normalized/encoded is expected behaviour with > urlnormalizer-basic active. I would recommend to keep it this way and have > all URLs in Solr properly encoded. Having spaces in Solr IDs is also not > recommended as it can lead to unexpected behaviour. > > If you really don't want them encoded, disable urlnormalizer-basic in your > configuration. > > Regards, > Markus > > Op di 9 jan 2024 om 19:20 schreef Steve Cohen : > > > Hello, > > > > I am updating a nutch crawl that read files in directories that have > > spaces. The urls show %20 instead of spaces. This doesn't seem to be what > > the behavior was in the past. > > > > In nutch 1.10 I get these results > > > > Nutch 1.10 > > > > > > > > ParseData:: > > Version: 5 > > Status: success(1,0) > > Title: Index of /nycor/10-15-2018 and on - Scanned > > Outlinks: 4 > > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor: > > 2018/ > > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor: > > 2019/ > > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor: > > 2022/ > > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date > > Unknown/ anchor: Shipment Date Unknown/ > > > > in Nutch 1.19, I get this > > > > > > ParseData:: > > Version: 5 > > Status: success(1,0) > > Title: Index of /nycor/10-15-2018 and on - Scanned > > Outlinks: 4 > > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/ > > anchor: 2018/ > > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/ > > anchor: 2019/ > > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/ > > anchor: 2022/ > > outlink: toUrl: > > > file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/ > > anchor: Shipment Date Unknown/ > > > > We are uploading to solr and the links aren't right with the %20s in the > > url. How do I remove the %20s? > > > > Thanks, > > Steve Cohen > > >
Re: nutch adds %20 in urls instead of spaces
Hello Steve, Having those spaces normalized/encoded is expected behaviour with urlnormalizer-basic active. I would recommend to keep it this way and have all URLs in Solr properly encoded. Having spaces in Solr IDs is also not recommended as it can lead to unexpected behaviour. If you really don't want them encoded, disable urlnormalizer-basic in your configuration. Regards, Markus Op di 9 jan 2024 om 19:20 schreef Steve Cohen : > Hello, > > I am updating a nutch crawl that read files in directories that have > spaces. The urls show %20 instead of spaces. This doesn't seem to be what > the behavior was in the past. > > In nutch 1.10 I get these results > > Nutch 1.10 > > > > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /nycor/10-15-2018 and on - Scanned > Outlinks: 4 > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor: > 2018/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor: > 2019/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor: > 2022/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date > Unknown/ anchor: Shipment Date Unknown/ > > in Nutch 1.19, I get this > > > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /nycor/10-15-2018 and on - Scanned > Outlinks: 4 > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/ > anchor: 2018/ > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/ > anchor: 2019/ > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/ > anchor: 2022/ > outlink: toUrl: > file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/ > anchor: Shipment Date Unknown/ > > We are uploading to solr and the links aren't right with the %20s in the > url. How do I remove the %20s? > > Thanks, > Steve Cohen >
Re: nutch adds %20 in urls instead of spaces
unsubscribe On Tue, Jan 9, 2024 at 1:20 PM Steve Cohen wrote: > Hello, > > I am updating a nutch crawl that read files in directories that have > spaces. The urls show %20 instead of spaces. This doesn't seem to be what > the behavior was in the past. > > In nutch 1.10 I get these results > > Nutch 1.10 > > > > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /nycor/10-15-2018 and on - Scanned > Outlinks: 4 > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor: > 2018/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor: > 2019/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor: > 2022/ > outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date > Unknown/ anchor: Shipment Date Unknown/ > > in Nutch 1.19, I get this > > > ParseData:: > Version: 5 > Status: success(1,0) > Title: Index of /nycor/10-15-2018 and on - Scanned > Outlinks: 4 > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/ > anchor: 2018/ > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/ > anchor: 2019/ > outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/ > anchor: 2022/ > outlink: toUrl: > file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/ > anchor: Shipment Date Unknown/ > > We are uploading to solr and the links aren't right with the %20s in the > url. How do I remove the %20s? > > Thanks, > Steve Cohen >
nutch adds %20 in urls instead of spaces
Hello, I am updating a nutch crawl that read files in directories that have spaces. The urls show %20 instead of spaces. This doesn't seem to be what the behavior was in the past. In nutch 1.10 I get these results Nutch 1.10 ParseData:: Version: 5 Status: success(1,0) Title: Index of /nycor/10-15-2018 and on - Scanned Outlinks: 4 outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor: 2018/ outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor: 2019/ outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor: 2022/ outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date Unknown/ anchor: Shipment Date Unknown/ in Nutch 1.19, I get this ParseData:: Version: 5 Status: success(1,0) Title: Index of /nycor/10-15-2018 and on - Scanned Outlinks: 4 outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/ anchor: 2018/ outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/ anchor: 2019/ outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/ anchor: 2022/ outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/ anchor: Shipment Date Unknown/ We are uploading to solr and the links aren't right with the %20s in the url. How do I remove the %20s? Thanks, Steve Cohen
Re: Nutch - Restriction by content type
Hello, You can skip certain types of documents based on their file extension, using the urlfilter-suffix. It only filters known suffixes. Filtering based on content type is not possible, because to know the content type requires fetching and parsing them. You can skip specific content types when indexing using the Jexl indexing filter. Regards, Markus Op do 16 nov 2023 om 14:56 schreef Raj Chidara : > Hello > Can we control crawling of web pages by its content type through any > configuration setting? For example, I want to crawl only pages whose > content type is text/html from a website and does not want to crawl other > pages/files. > > > > Thanks and Regards > > Raj Chidara > > > > > > Worldwide Offices: > > USA | UK | India | Singapore | Japan > > *ISO 9001, 27001, 2 Compliant > > > > www.DDIsmart.com > > > > > > > > > > > > > > DISCLAIMER: This message is intended solely for the use of the individual > or entity to which it is addressed. If you are not the intended recipient, > you should not use, copy, alter, or disclose the contents of this message. > All information or opinions expressed in this message and/or any > attachments are those of the author and are not necessarily those of the > group companies. > > > > > > >
Nutch - Restriction by content type
Hello Can we control crawling of web pages by its content type through any configuration setting? For example, I want to crawl only pages whose content type is text/html from a website and does not want to crawl other pages/files. Thanks and Regards Raj Chidara Worldwide Offices: USA | UK | India | Singapore | Japan *ISO 9001, 27001, 2 Compliant www.DDIsmart.com DISCLAIMER: This message is intended solely for the use of the individual or entity to which it is addressed. If you are not the intended recipient, you should not use, copy, alter, or disclose the contents of this message. All information or opinions expressed in this message and/or any attachments are those of the author and are not necessarily those of the group companies.
Re: [DISCUSS] Removing Any23 from Nutch?
+1 Tim. On Wed, Sep 13, 2023 at 16:50 > > > > -- Forwarded message -- > From: Tim Allison > To: user@nutch.apache.org, d...@nutch.apache.org > Cc: > Bcc: > Date: Wed, 13 Sep 2023 10:50:08 -0400 > Subject: [DISCUSS] Removing Any23 from Nutch? > All, > I opened https://issues.apache.org/jira/browse/NUTCH-2998 a few weeks > ago. Any23 was moved to the attic in June. Unless there are objections, I > propose removing it from Nutch before the next release. > Any objections? > >Best, > >Tim >
[DISCUSS] Removing Any23 from Nutch?
All, I opened https://issues.apache.org/jira/browse/NUTCH-2998 a few weeks ago. Any23 was moved to the attic in June. Unless there are objections, I propose removing it from Nutch before the next release. Any objections? Best, Tim
Re: Nutch Exception
Hello, Please check the logs for more information. Regards, Markus Op ma 24 jul 2023 om 19:05 schreef Raj Chidara : > Hi > > Nutch 1.19 compiled with ant without any errors and when running > Injector, getting an error that > > > > 19:20:25.055 [main] ERROR org.apache.nutch.crawl.Injector - Injector job > did not succeed, job id: job_local952809651_0001, job status: FAILED, > reason: NA > > Exception in thread "main" java.lang.RuntimeException: Injector job did > not succeed, job id: job_local952809651_0001, job status: FAILED, reason: NA > > at org.apache.nutch.crawl.Injector.inject(Injector.java:442) > > at org.apache.nutch.crawl.Injector.inject(Injector.java:365) > > at org.apache.nutch.crawl.Injector.inject(Injector.java:360) > > at org.apache.nutch.crawl.Crawl.run(Crawl.java:249) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:146) > > > > > > Thanks and Regards > > Raj Chidara
Nutch Exception
Hi Nutch 1.19 compiled with ant without any errors and when running Injector, getting an error that 19:20:25.055 [main] ERROR org.apache.nutch.crawl.Injector - Injector job did not succeed, job id: job_local952809651_0001, job status: FAILED, reason: NA Exception in thread "main" java.lang.RuntimeException: Injector job did not succeed, job id: job_local952809651_0001, job status: FAILED, reason: NA at org.apache.nutch.crawl.Injector.inject(Injector.java:442) at org.apache.nutch.crawl.Injector.inject(Injector.java:365) at org.apache.nutch.crawl.Injector.inject(Injector.java:360) at org.apache.nutch.crawl.Crawl.run(Crawl.java:249) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81) at org.apache.nutch.crawl.Crawl.main(Crawl.java:146) Thanks and Regards Raj Chidara
Nutch 1.19 in eclipse
HI I am following instructions as given here to run Nutch 1.19 in eclipse (2022-03) with java version 11.0.19 https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-Beforeyoustart Created project is giving build errors The package org.w3c.dom is accessible from more than one module: , java.xml and not able to continue next steps. Please help me in this regarding to this problem Thanks and Regards Raj Chidara
Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison
Thank you, all! I’m thrilled to join the team! On Thu, Jul 20, 2023 at 9:42 AM Julien Nioche wrote: > What a fantastic addition to the Nutch team! Congrats to Tim > > On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel wrote: > >> Dear all, >> >> It is my pleasure to announce that Tim Allison has joined us >> as a committer and member of the Nutch PMC. >> >> You may already know Tim as a maintainer of and contributor to >> Apache Tika. So, it was great to see contributions to the >> Nutch source code from an experienced developer who is also >> active in a related Apache project. Among other contributions >> Tim recently implemented the indexer-opensearch plugin. >> >> Thank you, Tim Allison, and congratulations on your new role >> in the Apache Nutch community! And welcome on board! >> >> Sebastian >> (on behalf of the Nutch PMC) > > >> > > -- > > *Open Source Solutions for Text Engineering* > > http://www.digitalpebble.com > http://digitalpebble.blogspot.com/ > #digitalpebble <http://twitter.com/digitalpebble> >
Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison
What a fantastic addition to the Nutch team! Congrats to Tim On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel wrote: > Dear all, > > It is my pleasure to announce that Tim Allison has joined us > as a committer and member of the Nutch PMC. > > You may already know Tim as a maintainer of and contributor to > Apache Tika. So, it was great to see contributions to the > Nutch source code from an experienced developer who is also > active in a related Apache project. Among other contributions > Tim recently implemented the indexer-opensearch plugin. > > Thank you, Tim Allison, and congratulations on your new role > in the Apache Nutch community! And welcome on board! > > Sebastian > (on behalf of the Nutch PMC) > -- *Open Source Solutions for Text Engineering* http://www.digitalpebble.com http://digitalpebble.blogspot.com/ #digitalpebble <http://twitter.com/digitalpebble>
[ANNOUNCE] New Nutch committer and PMC - Tim Allison
Dear all, It is my pleasure to announce that Tim Allison has joined us as a committer and member of the Nutch PMC. You may already know Tim as a maintainer of and contributor to Apache Tika. So, it was great to see contributions to the Nutch source code from an experienced developer who is also active in a related Apache project. Among other contributions Tim recently implemented the indexer-opensearch plugin. Thank you, Tim Allison, and congratulations on your new role in the Apache Nutch community! And welcome on board! Sebastian (on behalf of the Nutch PMC)
Re: Nutch 1.19 Getting Error: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
Hi Eric, unfortunately, on Windows you also need to download and install winutils.exe and hadoop.dll, see https://github.com/cdarlint/winutils and https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io The installation of Hadoop is not mandatory - the Nutch binary package already includes Hadoop jar files. Alternatively, you may prefer to run Nutch on Linux - no additional installations required. Best, Sebastian On 5/15/23 04:07, Eric Valencia wrote: Hello everyone, So, I set up Nutch 1.19, Solr 8.11.2, and hadoop 3.3.5, to the best of my knowledge. After, I went into the nutch directory and ran this command: *bin/nutch generate crawl/crawldb crawl/segments* Then, I got an error: *Exception in thread "main" java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'* Does anyone know how to solve this problem? Below is the full output: $ bin/nutch generate crawl/crawldb crawl/segments SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2023-05-14 19:01:16,433 INFO o.a.n.p.PluginManifestParser [main] Plugins: looking in: C:\Users\User\Desktop\wiki\a\ApacheNutch\apache-nutch-1.19\plugins 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Plugin Auto-activation mode: [true] 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Registered Plugins: 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Regex URL Filter (urlfilter-regex) 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Html Parse Plug-in (parse-html) 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]HTTP Framework (lib-http) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] the nutch core extension points (nutch-extensionpoints) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic Indexing Filter (index-basic) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Anchor Indexing Filter (index-anchor) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Tika Parser Plug-in (parse-tika) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic URL Normalizer (urlnormalizer-basic) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL Filter Framework (lib-regex-filter) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL Normalizer (urlnormalizer-regex) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]URL Validator (urlfilter-validator) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]CyberNeko HTML Parser (lib-nekohtml) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]OPIC Scoring Plug-in (scoring-opic) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] Pass-through URL Normalizer (urlnormalizer-pass) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Http Protocol Plug-in (protocol-http) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] SolrIndexWriter (indexer-solr) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] Registered Extension-Points: 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] (Nutch Content Parser) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL Filter) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (HTML Parse Filter) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Scoring) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL Normalizer) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Publisher) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Exchange) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Protocol) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL Ignore Exemption Filter) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Index Writer) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Segment Merge Filter) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Indexing Filter) 2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: starting at 2023-05-14 19:01:16 2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: Selecting best-scoring urls due for fetch. 2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: filtering: true 2023-0
Nutch 1.19 Getting Error: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'
Hello everyone, So, I set up Nutch 1.19, Solr 8.11.2, and hadoop 3.3.5, to the best of my knowledge. After, I went into the nutch directory and ran this command: *bin/nutch generate crawl/crawldb crawl/segments* Then, I got an error: *Exception in thread "main" java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'* Does anyone know how to solve this problem? Below is the full output: $ bin/nutch generate crawl/crawldb crawl/segments SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] 2023-05-14 19:01:16,433 INFO o.a.n.p.PluginManifestParser [main] Plugins: looking in: C:\Users\User\Desktop\wiki\a\ApacheNutch\apache-nutch-1.19\plugins 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Plugin Auto-activation mode: [true] 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Registered Plugins: 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Regex URL Filter (urlfilter-regex) 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Html Parse Plug-in (parse-html) 2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]HTTP Framework (lib-http) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] the nutch core extension points (nutch-extensionpoints) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic Indexing Filter (index-basic) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Anchor Indexing Filter (index-anchor) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Tika Parser Plug-in (parse-tika) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic URL Normalizer (urlnormalizer-basic) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL Filter Framework (lib-regex-filter) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL Normalizer (urlnormalizer-regex) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]URL Validator (urlfilter-validator) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]CyberNeko HTML Parser (lib-nekohtml) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]OPIC Scoring Plug-in (scoring-opic) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] Pass-through URL Normalizer (urlnormalizer-pass) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Http Protocol Plug-in (protocol-http) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] SolrIndexWriter (indexer-solr) 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] Registered Extension-Points: 2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] (Nutch Content Parser) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL Filter) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (HTML Parse Filter) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Scoring) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL Normalizer) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Publisher) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Exchange) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Protocol) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL Ignore Exemption Filter) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Index Writer) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Segment Merge Filter) 2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch Indexing Filter) 2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: starting at 2023-05-14 19:01:16 2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: Selecting best-scoring urls due for fetch. 2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: filtering: true 2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: normalizing: true 2023-05-14 19:01:16,974 INFO o.a.n.c.Generator [main] Generator: running in local mode, generating exactly one partition. Exception in thread "main" java.lang.UnsatisfiedLinkError: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)' at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native Method) at org.apache.hadoop.io.nativeio.NativeIO$Windows.acce
Re: Nutch 1.19/Hadoop compatible
Hello Mike, > Is nutch 1.19 compatible with Hadoop 3.3.4? Yes! Regards, Markus Op di 7 mrt 2023 om 17:37 schreef Mike : > Hello! > > Is nutch 1.19 compatible with Hadoop 3.3.4? > > > Thanks! > > mike >
Nutch 1.19/Hadoop compatible
Hello! Is nutch 1.19 compatible with Hadoop 3.3.4? Thanks! mike
Re: Configuration Nutch in cluster mode
Hallo Sebastian! I have now installed hadoop, unfortunately there are problems. Will make a post.. Thanks Mike Am Di., 17. Jan. 2023 um 09:49 Uhr schrieb Sebastian Nagel : > Hi Mike, > > the Nutch configuration files are included in the job file found in > runtime/deploy after build. This means you need to compile Nutch yourself > if used in "distributed" mode. > > For exercising, you can first work in "pseudo-distributed" mode, i.e. > on a single-node Hadoop cluster. All commands are the same than in fully > distributed mode. > > If it helps, I prepared some setup scripts to run Nutch in > pseudo-distributed mode: >https://github.com/sebastian-nagel/nutch-test-single-node-cluster > > Best, > Sebastian > > On 1/15/23 04:26, Mike wrote: > > I will now try to configure the bot url etc. before the building, > > but how and where do I configure between the crawls e.g. number of pages > > per host? > > > > where do I configure nutch in cluster mode? > > > > thx, mike > > >
Re: Configuration Nutch in cluster mode
Hi Mike, the Nutch configuration files are included in the job file found in runtime/deploy after build. This means you need to compile Nutch yourself if used in "distributed" mode. For exercising, you can first work in "pseudo-distributed" mode, i.e. on a single-node Hadoop cluster. All commands are the same than in fully distributed mode. If it helps, I prepared some setup scripts to run Nutch in pseudo-distributed mode: https://github.com/sebastian-nagel/nutch-test-single-node-cluster Best, Sebastian On 1/15/23 04:26, Mike wrote: I will now try to configure the bot url etc. before the building, but how and where do I configure between the crawls e.g. number of pages per host? where do I configure nutch in cluster mode? thx, mike
Re: Nutch/Hadoop Cluster
Hi Mike, > It can be tedious to set up for the first time, and there are many components. In case you prefer Linux packages, I can recommend Apache Bigtop, see https://bigtop.apache.org/ and for the list of package repositories https://downloads.apache.org/bigtop/stable/repos/ ~Sebastian On 1/15/23 01:06, Markus Jelsma wrote: Hello Mike, would it pay off for me to put a hadoop cluster on top of the 3 servers. Yes, for as many reasons as Hadoop exists for. It can be tedious to set up for the first time, and there are many components. But at least you have three servers, which is kind of required by Zookeeper, that you will also need. Ideally you would have some additional VMs to run the controlling Hadoop programs and perhaps the Hadoop client nodes on. The workers can run on bare metal. 1.) a server would not be integrated directly into the crawl process as a master. What do you mean? Can you elaborate? 2.) can I run multiple crawl jobs on one server? Yes! Just have separate instances of Nutch home dirs on your Hadoop client nodes, each having their own configuration. Regards, Markus Op za 14 jan. 2023 om 18:42 schreef Mike : Hi! I am now crawling the internet in local mode in parallel with up to 10 instances on 3 computers. would it pay off for me to put a hadoop cluster on top of the 3 servers. 1.) a server would not be integrated directly into the crawl process as a master. 2.) can I run multiple crawl jobs on one server? Thanks
Configuration Nutch in cluster mode
I will now try to configure the bot url etc. before the building, but how and where do I configure between the crawls e.g. number of pages per host? where do I configure nutch in cluster mode? thx, mike
Re: Nutch/Hadoop Cluster
Hello Mike, > would it pay off for me to put a hadoop cluster on top of the 3 servers. Yes, for as many reasons as Hadoop exists for. It can be tedious to set up for the first time, and there are many components. But at least you have three servers, which is kind of required by Zookeeper, that you will also need. Ideally you would have some additional VMs to run the controlling Hadoop programs and perhaps the Hadoop client nodes on. The workers can run on bare metal. > 1.) a server would not be integrated directly into the crawl process as a master. What do you mean? Can you elaborate? > 2.) can I run multiple crawl jobs on one server? Yes! Just have separate instances of Nutch home dirs on your Hadoop client nodes, each having their own configuration. Regards, Markus Op za 14 jan. 2023 om 18:42 schreef Mike : > Hi! > > I am now crawling the internet in local mode in parallel with up to 10 > instances on 3 computers. would it pay off for me to put a hadoop cluster > on top of the 3 servers. > > 1.) a server would not be integrated directly into the crawl process as a > master. > 2.) can I run multiple crawl jobs on one server? > > Thanks >
Nutch/Hadoop Cluster
Hi! I am now crawling the internet in local mode in parallel with up to 10 instances on 3 computers. would it pay off for me to put a hadoop cluster on top of the 3 servers. 1.) a server would not be integrated directly into the crawl process as a master. 2.) can I run multiple crawl jobs on one server? Thanks
Re: Nutch/Hadoop: Error (FreeGenerator job did not succeed)
Hello, You cannot just run Nutch's JAR like that on Hadoop, you need the large .job file instead. If you build Nutch from source, you will get a runtime/deploy directory. Upload its contents to a Hadoop client and run Nutch commands using bin/nutch ... You will then automatically use the large .job file that is on the same level as the bin directory. Application log files on Hadoop are to be found everywhere. Select individuel mapper or reduce subtasks, click deeper, and look to inspect their logs. That is where the application logs are to be found. Good luck! Markus Op vr 14 okt. 2022 om 16:18 schreef Mike : > Hi! > > I've been using Nutch for a while but I'm new to hadoop. got a cluster with > hadoop 3.2.3 installed. > > do i have to install nutch on the hadoop filesystem or can i run it > "local"? the clients don't need more from nutch than the info on master in > the command line: hadoop jar /home/debian/nutch40/lib/apache-nutch-1.19.jar > org.apache.nutch.tools.FreeGenerator -conf /home/debian/ > nutch40/conf/nutch-default.xml > -Dplugin.folder=/home/debian/nutch40/plugins/ > /crawl/urls//tranco-top350k-20221007.txt /home/debian/crawl/segments/ > > I get an error on the command: > > Exception in thread "main" java.lang.RuntimeException: FreeGenerator job > did not succeed, job id: job_1665751705815_0007, job status: FAILED, > reason: Task failed task_1665751705815_0007_m_00 > > > Since I'm new I can't find the logs in hadoop properly yet. > > Is there a guide how to install Natch (1.19) on Hadoop that I can't find? > > Thanks > Mike >
Nutch/Hadoop: Error (FreeGenerator job did not succeed)
Hi! I've been using Nutch for a while but I'm new to hadoop. got a cluster with hadoop 3.2.3 installed. do i have to install nutch on the hadoop filesystem or can i run it "local"? the clients don't need more from nutch than the info on master in the command line: hadoop jar /home/debian/nutch40/lib/apache-nutch-1.19.jar org.apache.nutch.tools.FreeGenerator -conf /home/debian/ nutch40/conf/nutch-default.xml -Dplugin.folder=/home/debian/nutch40/plugins/ /crawl/urls//tranco-top350k-20221007.txt /home/debian/crawl/segments/ I get an error on the command: Exception in thread "main" java.lang.RuntimeException: FreeGenerator job did not succeed, job id: job_1665751705815_0007, job status: FAILED, reason: Task failed task_1665751705815_0007_m_00 Since I'm new I can't find the logs in hadoop properly yet. Is there a guide how to install Natch (1.19) on Hadoop that I can't find? Thanks Mike
[ANNOUNCE] Apache Nutch 1.19 Release
The Apache Nutch team is pleased to announce the release of Apache Nutch v1.19. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache Hadoop™ data structures. Source and binary distributions are available for download from the Apache Nutch download site: https://nutch.apache.org/downloads.html Please verify signatures using the KEYS file available at the above location when downloading the release. This release includes more than 80 bug fixes and improvements, the full list of changes can be seen in the release report https://s.apache.org/lf6li Please also check the changelog for breaking changes: https://apache.org/dist/nutch/1.19/CHANGES.txt Important changes are: - Nutch builds on JDK 11 - protocol plugins can provide a custom URL stream handler to support custom URL schemes, eg. smb:// and notable dependency upgrades include: Hadoop 3.3.4 Solr 8.11.2 Tika 2.3.0 Thanks to everyone who contributed to this release!
[RESULT] was [VOTE] Release Apache Nutch 1.19 RC#1
Hi Folks, thanks to everyone who was able to review the release candidate! 72 hours have definitely passed, please see below for vote results. [4] +1 Release this package as Apache Nutch 1.19 Markus Jelsma * BlackIce * Jorge Betancourt * Sebastian Nagel * [0] -1 Do not release this package because ... * Nutch PMC The VOTE passes with 4 binding votes from Nutch PMC members. I'll continue to publish the release packages and announce the release. Thanks to everyone who contributed to Nutch and the 1.19 release. Sebastian On 8/22/22 17:30, Sebastian Nagel wrote: > Hi Folks, > > A first candidate for the Nutch 1.19 release is available at: > >https://dist.apache.org/repos/dist/dev/nutch/1.19/ > > The release candidate is a zip and tar.gz archive of the binary and sources > in: > https://github.com/apache/nutch/tree/release-1.19 > > In addition, a staged maven repository is available here: >https://repository.apache.org/content/repositories/orgapachenutch-1020 > > We addressed 87 issues: >https://s.apache.org/lf6li > > > Please vote on releasing this package as Apache Nutch 1.19. > The vote is open for the next 72 hours and passes if a majority > of at least three +1 Nutch PMC votes are cast. > > [ ] +1 Release this package as Apache Nutch 1.19. > [ ] -1 Do not release this package because… > > Cheers, > Sebastian > (On behalf of the Nutch PMC) > > P.S. > Here is my +1. > - tested most of Nutch tools and run a test crawl on a single-node cluster > running Hadoop 3.3.4, see > https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
Re: Nutch 1.19 schema.xml
Hi Mike, I think there shouldn't be any issues upgrading the new schema.xml into the Solr core holding the index filled from Nutch. Maybe with two exceptions: - index-geoip is used (then some field definitions may change) - when an older Solr version is used (eg. not yet supporting solr.LatLonPointSpatialField) In doubt, I'd run a test before to be sure that the production system isn't broken. Best, Sebastian On 9/4/22 18:08, Mike wrote: > Hello Sebastian! > > Thanks for your answer! > Is it possible to simply update the schema.xml file without re-indexing? > > Thanks > Mike > > Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel > : > >> Hi Mike, >> >> the Nutch/Solr schema.xml will be updated with the release of 1.19 >> (expected >> soon, a vote about RC#1 is ongoing): >> [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType >> [NUTCH-2957] - add fall-back field definitions for unknown index fields >> [NUTCH-2956] - typos in field names filled by index-geoip >> >> See the commits on the schema.xml >> >> https://github.com/apache/nutch/commits/master/src/plugin/indexer-solr/schema.xml >> >> Best, >> Sebastian >> >> >> On 8/31/22 14:02, Mike wrote: >>> Hello! >>> >>> >>> Will the schema.xml stay the same in Nutch 1.19? >>> >>> thanks! >>> >>> mike >>> >> >
Re: Nutch 1.19 schema.xml
Hello Sebastian! Thanks for your answer! Is it possible to simply update the schema.xml file without re-indexing? Thanks Mike Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel : > Hi Mike, > > the Nutch/Solr schema.xml will be updated with the release of 1.19 > (expected > soon, a vote about RC#1 is ongoing): > [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType > [NUTCH-2957] - add fall-back field definitions for unknown index fields > [NUTCH-2956] - typos in field names filled by index-geoip > > See the commits on the schema.xml > > https://github.com/apache/nutch/commits/master/src/plugin/indexer-solr/schema.xml > > Best, > Sebastian > > > On 8/31/22 14:02, Mike wrote: > > Hello! > > > > > > Will the schema.xml stay the same in Nutch 1.19? > > > > thanks! > > > > mike > > >
Re: [VOTE] Release Apache Nutch 1.19 RC#1
Hi Markus, thanks! Could you share the files in .ivy2/cache/org.apache.httpcomponents/httpasyncclient/ and maybe also the logs of a Nutch build starting with an empty ~/.ivy2/cache ? I'll have a look and compare it what I find on my system - maybe use a new thread on user@ or a Jira issue, I'll plan to close the vote over the weekend, so let's keep this thread for the release vote alone. Best, Sebastian On 8/29/22 14:17, Markus Jelsma wrote: > Hello Sebastian, > > No, the JAR isn't present. Multiple JARs are missing, probably because they > are loaded after httpasyncclient. I checked the previously emptied Ivy > cache. The Ivy files are there, but the JAR is missing there too. > > markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/ > ivy-4.1.4.xml ivy-4.1.4.xml.original ivydata-4.1.4.properties > > I manually downloaded the JAR from [1] and added it to the jars/ directory > in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs > some more things than just adding the JAR manually. > > The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties > file in the cache. > > Since Ralf can compile it without problems, it seems to be an issue on my > machine only. So Nutch seems fine, therefore +1. > > Regards, > Markus > > [1] > https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/ > > > Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel > : > >> Hi Ralf, >> >>> It fetches it parses >> >> So a +1 ? >> >> Best, >> Sebastian >> >> On 8/25/22 05:22, BlackIce wrote: >>> nevermind I made a typo... >>> >>> It fetches it parses >>> >>> On Thu, Aug 25, 2022 at 3:42 AM BlackIce wrote: >>>> >>>> so far... it doesn't select anything when creating segments: >>>> 0 records selected for fetching, exiting >>>> >>>> On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: >>>>> >>>>> I have been able to compile under OpenJDK 11 >>>>> Have not done anything further so far >>>>> I'm gonna try to get to it this evening >>>>> >>>>> Greetz >>>>> Ralf >>>>> >>>>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma >>>>> wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> Everything seems fine, the crawler seems fine when trying the binary >>>>>> distribution. The source won't work because this computer still cannot >>>>>> compile it. Clearing the local Ivy cache did not do much. This is the >> known >>>>>> compiler error with the elastic-indexer plugin: >>>>>> compile: >>>>>> [echo] Compiling plugin: indexer-elastic >>>>>>[javac] Compiling 3 source files to >>>>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes >>>>>>[javac] >>>>>> >> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: >>>>>> error: package org.apache.http.impl.nio.client does not exist >>>>>>[javac] import >> org.apache.http.impl.nio.client.HttpAsyncClientBuilder; >>>>>>[javac] ^ >>>>>>[javac] 1 error >>>>>> >>>>>> >>>>>> The binary distribution works fine though. I do see a lot of new >> messages >>>>>> when fetching: >>>>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters >> [LocalJobRunner >>>>>> Map Task Executor #0] Found 0 extensions at >>>>>> point:'org.apache.nutch.net.URLExemptionFilter' >>>>>> >>>>>> This is also new at start of each task: >>>>>> SLF4J: Class path contains multiple SLF4J bindings. >>>>>> SLF4J: Found binding in >>>>>> >> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>>> >>>>>> SLF4J: Found binding in >>>>>> >> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>>>> >>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>>>>> explanation. >>>>>> SLF4
Re: Nutch 1.19 schema.xml
Hi Mike, the Nutch/Solr schema.xml will be updated with the release of 1.19 (expected soon, a vote about RC#1 is ongoing): [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType [NUTCH-2957] - add fall-back field definitions for unknown index fields [NUTCH-2956] - typos in field names filled by index-geoip See the commits on the schema.xml https://github.com/apache/nutch/commits/master/src/plugin/indexer-solr/schema.xml Best, Sebastian On 8/31/22 14:02, Mike wrote: > Hello! > > > Will the schema.xml stay the same in Nutch 1.19? > > thanks! > > mike >
Nutch 1.19 schema.xml
Hello! Will the schema.xml stay the same in Nutch 1.19? thanks! mike
Re: [VOTE] Release Apache Nutch 1.19 RC#1
Hi all, Compiled from the sources (JDK11) and ran a small crawl and indexing (to Solr) both passed with flying colors. That's a +1 from me. Great work Sebastian! On Mon, Aug 22, 2022 at 5:30 PM Sebastian Nagel wrote: > Hi Folks, > > A first candidate for the Nutch 1.19 release is available at: > >https://dist.apache.org/repos/dist/dev/nutch/1.19/ > > The release candidate is a zip and tar.gz archive of the binary and > sources in: > https://github.com/apache/nutch/tree/release-1.19 > > In addition, a staged maven repository is available here: >https://repository.apache.org/content/repositories/orgapachenutch-1020 > > We addressed 87 issues: >https://s.apache.org/lf6li > > > Please vote on releasing this package as Apache Nutch 1.19. > The vote is open for the next 72 hours and passes if a majority > of at least three +1 Nutch PMC votes are cast. > > [ ] +1 Release this package as Apache Nutch 1.19. > [ ] -1 Do not release this package because… > > Cheers, > Sebastian > (On behalf of the Nutch PMC) > > P.S. > Here is my +1. > - tested most of Nutch tools and run a test crawl on a single-node cluster > running Hadoop 3.3.4, see > https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) >
Re: [VOTE] Release Apache Nutch 1.19 RC#1
OK, I compiled Nutch under JDK11 Did some basic fetching, parsing, linkinversion and posterior indexing to Solr 9 [+1] Great work! RRK On Tue, Aug 30, 2022 at 12:22 PM BlackIce wrote: > > Tried some indexing... but when manually doing "Invertilinks" it says > something about input path does not exist. > Has invertilinks changed since 1.18? > > Greetz > RRK > > On Mon, Aug 29, 2022 at 3:38 PM BlackIce wrote: > > > > Haven't indexed anything to solr.. gonna give it a shot in a few hours > > > > On Mon, Aug 29, 2022 at 2:17 PM Markus Jelsma > > wrote: > > > > > > Hello Sebastian, > > > > > > No, the JAR isn't present. Multiple JARs are missing, probably because > > > they > > > are loaded after httpasyncclient. I checked the previously emptied Ivy > > > cache. The Ivy files are there, but the JAR is missing there too. > > > > > > markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/ > > > ivy-4.1.4.xml ivy-4.1.4.xml.original ivydata-4.1.4.properties > > > > > > I manually downloaded the JAR from [1] and added it to the jars/ directory > > > in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache > > > needs > > > some more things than just adding the JAR manually. > > > > > > The odd thing is, that i got the URL below FROM the > > > ivydata-4.1.4.properties > > > file in the cache. > > > > > > Since Ralf can compile it without problems, it seems to be an issue on my > > > machine only. So Nutch seems fine, therefore +1. > > > > > > Regards, > > > Markus > > > > > > [1] > > > https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/ > > > > > > > > > Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel > > > : > > > > > > > Hi Ralf, > > > > > > > > > It fetches it parses > > > > > > > > So a +1 ? > > > > > > > > Best, > > > > Sebastian > > > > > > > > On 8/25/22 05:22, BlackIce wrote: > > > > > nevermind I made a typo... > > > > > > > > > > It fetches it parses > > > > > > > > > > On Thu, Aug 25, 2022 at 3:42 AM BlackIce > > > > > wrote: > > > > >> > > > > >> so far... it doesn't select anything when creating segments: > > > > >> 0 records selected for fetching, exiting > > > > >> > > > > >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce > > > > >> wrote: > > > > >>> > > > > >>> I have been able to compile under OpenJDK 11 > > > > >>> Have not done anything further so far > > > > >>> I'm gonna try to get to it this evening > > > > >>> > > > > >>> Greetz > > > > >>> Ralf > > > > >>> > > > > >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma > > > > >>> wrote: > > > > >>>> > > > > >>>> Hi, > > > > >>>> > > > > >>>> Everything seems fine, the crawler seems fine when trying the > > > > >>>> binary > > > > >>>> distribution. The source won't work because this computer still > > > > >>>> cannot > > > > >>>> compile it. Clearing the local Ivy cache did not do much. This is > > > > >>>> the > > > > known > > > > >>>> compiler error with the elastic-indexer plugin: > > > > >>>> compile: > > > > >>>> [echo] Compiling plugin: indexer-elastic > > > > >>>>[javac] Compiling 3 source files to > > > > >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes > > > > >>>>[javac] > > > > >>>> > > > > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: > > > > >>>> error: package org.apache.http.impl.nio.client does not exist > > > > >>>>[javac] import > > > > org.apache.http.impl.nio.client.HttpAsyncClientBuilder; > > > > >>>>[javac]
Re: [VOTE] Release Apache Nutch 1.19 RC#1
Tried some indexing... but when manually doing "Invertilinks" it says something about input path does not exist. Has invertilinks changed since 1.18? Greetz RRK On Mon, Aug 29, 2022 at 3:38 PM BlackIce wrote: > > Haven't indexed anything to solr.. gonna give it a shot in a few hours > > On Mon, Aug 29, 2022 at 2:17 PM Markus Jelsma > wrote: > > > > Hello Sebastian, > > > > No, the JAR isn't present. Multiple JARs are missing, probably because they > > are loaded after httpasyncclient. I checked the previously emptied Ivy > > cache. The Ivy files are there, but the JAR is missing there too. > > > > markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/ > > ivy-4.1.4.xml ivy-4.1.4.xml.original ivydata-4.1.4.properties > > > > I manually downloaded the JAR from [1] and added it to the jars/ directory > > in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs > > some more things than just adding the JAR manually. > > > > The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties > > file in the cache. > > > > Since Ralf can compile it without problems, it seems to be an issue on my > > machine only. So Nutch seems fine, therefore +1. > > > > Regards, > > Markus > > > > [1] > > https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/ > > > > > > Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel > > : > > > > > Hi Ralf, > > > > > > > It fetches it parses > > > > > > So a +1 ? > > > > > > Best, > > > Sebastian > > > > > > On 8/25/22 05:22, BlackIce wrote: > > > > nevermind I made a typo... > > > > > > > > It fetches it parses > > > > > > > > On Thu, Aug 25, 2022 at 3:42 AM BlackIce wrote: > > > >> > > > >> so far... it doesn't select anything when creating segments: > > > >> 0 records selected for fetching, exiting > > > >> > > > >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: > > > >>> > > > >>> I have been able to compile under OpenJDK 11 > > > >>> Have not done anything further so far > > > >>> I'm gonna try to get to it this evening > > > >>> > > > >>> Greetz > > > >>> Ralf > > > >>> > > > >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma > > > >>> wrote: > > > >>>> > > > >>>> Hi, > > > >>>> > > > >>>> Everything seems fine, the crawler seems fine when trying the binary > > > >>>> distribution. The source won't work because this computer still > > > >>>> cannot > > > >>>> compile it. Clearing the local Ivy cache did not do much. This is the > > > known > > > >>>> compiler error with the elastic-indexer plugin: > > > >>>> compile: > > > >>>> [echo] Compiling plugin: indexer-elastic > > > >>>>[javac] Compiling 3 source files to > > > >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes > > > >>>>[javac] > > > >>>> > > > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: > > > >>>> error: package org.apache.http.impl.nio.client does not exist > > > >>>>[javac] import > > > org.apache.http.impl.nio.client.HttpAsyncClientBuilder; > > > >>>>[javac] ^ > > > >>>>[javac] 1 error > > > >>>> > > > >>>> > > > >>>> The binary distribution works fine though. I do see a lot of new > > > messages > > > >>>> when fetching: > > > >>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters > > > [LocalJobRunner > > > >>>> Map Task Executor #0] Found 0 extensions at > > > >>>> point:'org.apache.nutch.net.URLExemptionFilter' > > > >>>> > > > >>>> This is also new at start of each task: > > > >>>> SLF4J: Class path contains multiple SLF4J bindings. > > > >>>> SLF4J: Found binding in > > > >>>
Re: [VOTE] Release Apache Nutch 1.19 RC#1
Haven't indexed anything to solr.. gonna give it a shot in a few hours On Mon, Aug 29, 2022 at 2:17 PM Markus Jelsma wrote: > > Hello Sebastian, > > No, the JAR isn't present. Multiple JARs are missing, probably because they > are loaded after httpasyncclient. I checked the previously emptied Ivy > cache. The Ivy files are there, but the JAR is missing there too. > > markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/ > ivy-4.1.4.xml ivy-4.1.4.xml.original ivydata-4.1.4.properties > > I manually downloaded the JAR from [1] and added it to the jars/ directory > in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs > some more things than just adding the JAR manually. > > The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties > file in the cache. > > Since Ralf can compile it without problems, it seems to be an issue on my > machine only. So Nutch seems fine, therefore +1. > > Regards, > Markus > > [1] > https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/ > > > Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel > : > > > Hi Ralf, > > > > > It fetches it parses > > > > So a +1 ? > > > > Best, > > Sebastian > > > > On 8/25/22 05:22, BlackIce wrote: > > > nevermind I made a typo... > > > > > > It fetches it parses > > > > > > On Thu, Aug 25, 2022 at 3:42 AM BlackIce wrote: > > >> > > >> so far... it doesn't select anything when creating segments: > > >> 0 records selected for fetching, exiting > > >> > > >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: > > >>> > > >>> I have been able to compile under OpenJDK 11 > > >>> Have not done anything further so far > > >>> I'm gonna try to get to it this evening > > >>> > > >>> Greetz > > >>> Ralf > > >>> > > >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma > > >>> wrote: > > >>>> > > >>>> Hi, > > >>>> > > >>>> Everything seems fine, the crawler seems fine when trying the binary > > >>>> distribution. The source won't work because this computer still cannot > > >>>> compile it. Clearing the local Ivy cache did not do much. This is the > > known > > >>>> compiler error with the elastic-indexer plugin: > > >>>> compile: > > >>>> [echo] Compiling plugin: indexer-elastic > > >>>>[javac] Compiling 3 source files to > > >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes > > >>>>[javac] > > >>>> > > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: > > >>>> error: package org.apache.http.impl.nio.client does not exist > > >>>> [javac] import > > org.apache.http.impl.nio.client.HttpAsyncClientBuilder; > > >>>>[javac] ^ > > >>>>[javac] 1 error > > >>>> > > >>>> > > >>>> The binary distribution works fine though. I do see a lot of new > > messages > > >>>> when fetching: > > >>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters > > [LocalJobRunner > > >>>> Map Task Executor #0] Found 0 extensions at > > >>>> point:'org.apache.nutch.net.URLExemptionFilter' > > >>>> > > >>>> This is also new at start of each task: > > >>>> SLF4J: Class path contains multiple SLF4J bindings. > > >>>> SLF4J: Found binding in > > >>>> > > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > >>>> > > >>>> SLF4J: Found binding in > > >>>> > > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > >>>> > > >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > > >>>> explanation. > > >>>> SLF4J: Actual binding is of type > > >>>> [org.apache.logging.slf4j.Log4jLoggerFactory] > > >>>> > > >>>> And this one at the end of fetcher: > > >
Re: [VOTE] Release Apache Nutch 1.19 RC#1
Hello Sebastian, No, the JAR isn't present. Multiple JARs are missing, probably because they are loaded after httpasyncclient. I checked the previously emptied Ivy cache. The Ivy files are there, but the JAR is missing there too. markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/ ivy-4.1.4.xml ivy-4.1.4.xml.original ivydata-4.1.4.properties I manually downloaded the JAR from [1] and added it to the jars/ directory in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs some more things than just adding the JAR manually. The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties file in the cache. Since Ralf can compile it without problems, it seems to be an issue on my machine only. So Nutch seems fine, therefore +1. Regards, Markus [1] https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/ Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel : > Hi Ralf, > > > It fetches it parses > > So a +1 ? > > Best, > Sebastian > > On 8/25/22 05:22, BlackIce wrote: > > nevermind I made a typo... > > > > It fetches it parses > > > > On Thu, Aug 25, 2022 at 3:42 AM BlackIce wrote: > >> > >> so far... it doesn't select anything when creating segments: > >> 0 records selected for fetching, exiting > >> > >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: > >>> > >>> I have been able to compile under OpenJDK 11 > >>> Have not done anything further so far > >>> I'm gonna try to get to it this evening > >>> > >>> Greetz > >>> Ralf > >>> > >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma > >>> wrote: > >>>> > >>>> Hi, > >>>> > >>>> Everything seems fine, the crawler seems fine when trying the binary > >>>> distribution. The source won't work because this computer still cannot > >>>> compile it. Clearing the local Ivy cache did not do much. This is the > known > >>>> compiler error with the elastic-indexer plugin: > >>>> compile: > >>>> [echo] Compiling plugin: indexer-elastic > >>>>[javac] Compiling 3 source files to > >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes > >>>>[javac] > >>>> > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: > >>>> error: package org.apache.http.impl.nio.client does not exist > >>>>[javac] import > org.apache.http.impl.nio.client.HttpAsyncClientBuilder; > >>>>[javac] ^ > >>>>[javac] 1 error > >>>> > >>>> > >>>> The binary distribution works fine though. I do see a lot of new > messages > >>>> when fetching: > >>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters > [LocalJobRunner > >>>> Map Task Executor #0] Found 0 extensions at > >>>> point:'org.apache.nutch.net.URLExemptionFilter' > >>>> > >>>> This is also new at start of each task: > >>>> SLF4J: Class path contains multiple SLF4J bindings. > >>>> SLF4J: Found binding in > >>>> > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > >>>> > >>>> SLF4J: Found binding in > >>>> > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] > >>>> > >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > >>>> explanation. > >>>> SLF4J: Actual binding is of type > >>>> [org.apache.logging.slf4j.Log4jLoggerFactory] > >>>> > >>>> And this one at the end of fetcher: > >>>> log4j:WARN No appenders could be found for logger > >>>> (org.apache.commons.httpclient.params.DefaultHttpParams). > >>>> log4j:WARN Please initialize the log4j system properly. > >>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig > for > >>>> more info. > >>>> > >>>> I am worried about the indexer-elastic plugin, maybe others have that > >>>> problem too? Otherwise everything seems fine. > >>>> > >>>> Markus > >>>> > >>>> Op ma 22 aug. 2022
Re: [VOTE] Release Apache Nutch 1.19 RC#1
Hi Ralf, > It fetches it parses So a +1 ? Best, Sebastian On 8/25/22 05:22, BlackIce wrote: > nevermind I made a typo... > > It fetches it parses > > On Thu, Aug 25, 2022 at 3:42 AM BlackIce wrote: >> >> so far... it doesn't select anything when creating segments: >> 0 records selected for fetching, exiting >> >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: >>> >>> I have been able to compile under OpenJDK 11 >>> Have not done anything further so far >>> I'm gonna try to get to it this evening >>> >>> Greetz >>> Ralf >>> >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma >>> wrote: >>>> >>>> Hi, >>>> >>>> Everything seems fine, the crawler seems fine when trying the binary >>>> distribution. The source won't work because this computer still cannot >>>> compile it. Clearing the local Ivy cache did not do much. This is the known >>>> compiler error with the elastic-indexer plugin: >>>> compile: >>>> [echo] Compiling plugin: indexer-elastic >>>>[javac] Compiling 3 source files to >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes >>>>[javac] >>>> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: >>>> error: package org.apache.http.impl.nio.client does not exist >>>>[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; >>>>[javac] ^ >>>>[javac] 1 error >>>> >>>> >>>> The binary distribution works fine though. I do see a lot of new messages >>>> when fetching: >>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner >>>> Map Task Executor #0] Found 0 extensions at >>>> point:'org.apache.nutch.net.URLExemptionFilter' >>>> >>>> This is also new at start of each task: >>>> SLF4J: Class path contains multiple SLF4J bindings. >>>> SLF4J: Found binding in >>>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>> >>>> SLF4J: Found binding in >>>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] >>>> >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an >>>> explanation. >>>> SLF4J: Actual binding is of type >>>> [org.apache.logging.slf4j.Log4jLoggerFactory] >>>> >>>> And this one at the end of fetcher: >>>> log4j:WARN No appenders could be found for logger >>>> (org.apache.commons.httpclient.params.DefaultHttpParams). >>>> log4j:WARN Please initialize the log4j system properly. >>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for >>>> more info. >>>> >>>> I am worried about the indexer-elastic plugin, maybe others have that >>>> problem too? Otherwise everything seems fine. >>>> >>>> Markus >>>> >>>> Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel : >>>> >>>>> Hi Folks, >>>>> >>>>> A first candidate for the Nutch 1.19 release is available at: >>>>> >>>>>https://dist.apache.org/repos/dist/dev/nutch/1.19/ >>>>> >>>>> The release candidate is a zip and tar.gz archive of the binary and >>>>> sources in: >>>>>https://github.com/apache/nutch/tree/release-1.19 >>>>> >>>>> In addition, a staged maven repository is available here: >>>>>https://repository.apache.org/content/repositories/orgapachenutch-1020 >>>>> >>>>> We addressed 87 issues: >>>>>https://s.apache.org/lf6li >>>>> >>>>> >>>>> Please vote on releasing this package as Apache Nutch 1.19. >>>>> The vote is open for the next 72 hours and passes if a majority >>>>> of at least three +1 Nutch PMC votes are cast. >>>>> >>>>> [ ] +1 Release this package as Apache Nutch 1.19. >>>>> [ ] -1 Do not release this package because… >>>>> >>>>> Cheers, >>>>> Sebastian >>>>> (On behalf of the Nutch PMC) >>>>> >>>>> P.S. >>>>> Here is my +1. >>>>> - tested most of Nutch tools and run a test crawl on a single-node cluster >>>>> running Hadoop 3.3.4, see >>>>> https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) >>>>>
Re: [VOTE] Release Apache Nutch 1.19 RC#1
Hi Markus, thanks! What's your (final) decision? >[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; During build the class should be provided in build/plugins/indexer-elastic/httpasyncclient-4.1.4.jar Could you verify whether this jar is there and whether it contains the class file? See also: https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/ > I am worried about the indexer-elastic plugin, maybe others have that > problem too? Otherwise everything seems fine. In order to fix it, we need to make the error reproducible resp. figure out what the reason is. Regarding the logging: we switched to log4j 2.x (NUTCH-2915) while Hadoop now uses reload4j (HADOOP-18088 [1]). The logging configuration should be improved to avoid the warnings in local mode. In distributed mode, the logging configuration of the provided Hadoop takes over. Best, Sebastian [1] https://issues.apache.org/jira/browse/HADOOP-18088 On 8/24/22 13:28, Markus Jelsma wrote: > Hi, > > Everything seems fine, the crawler seems fine when trying the binary > distribution. The source won't work because this computer still cannot > compile it. Clearing the local Ivy cache did not do much. This is the known > compiler error with the elastic-indexer plugin: > compile: > [echo] Compiling plugin: indexer-elastic >[javac] Compiling 3 source files to > /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes > [javac] > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: > error: package org.apache.http.impl.nio.client does not exist >[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; >[javac] ^ >[javac] 1 error > > > The binary distribution works fine though. I do see a lot of new messages > when fetching: > 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner > Map Task Executor #0] Found 0 extensions at > point:'org.apache.nutch.net.URLExemptionFilter' > > This is also new at start of each task: > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type > [org.apache.logging.slf4j.Log4jLoggerFactory] > > And this one at the end of fetcher: > log4j:WARN No appenders could be found for logger > (org.apache.commons.httpclient.params.DefaultHttpParams). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > more info. > > I am worried about the indexer-elastic plugin, maybe others have that > problem too? Otherwise everything seems fine. > > Markus > > Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel : > >> Hi Folks, >> >> A first candidate for the Nutch 1.19 release is available at: >> >>https://dist.apache.org/repos/dist/dev/nutch/1.19/ >> >> The release candidate is a zip and tar.gz archive of the binary and >> sources in: >>https://github.com/apache/nutch/tree/release-1.19 >> >> In addition, a staged maven repository is available here: >>https://repository.apache.org/content/repositories/orgapachenutch-1020 >> >> We addressed 87 issues: >>https://s.apache.org/lf6li >> >> >> Please vote on releasing this package as Apache Nutch 1.19. >> The vote is open for the next 72 hours and passes if a majority >> of at least three +1 Nutch PMC votes are cast. >> >> [ ] +1 Release this package as Apache Nutch 1.19. >> [ ] -1 Do not release this package because… >> >> Cheers, >> Sebastian >> (On behalf of the Nutch PMC) >> >> P.S. >> Here is my +1. >> - tested most of Nutch tools and run a test crawl on a single-node cluster >> running Hadoop 3.3.4, see >> https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) >> >
Re: [VOTE] Release Apache Nutch 1.19 RC#1
nevermind I made a typo... It fetches it parses On Thu, Aug 25, 2022 at 3:42 AM BlackIce wrote: > > so far... it doesn't select anything when creating segments: > 0 records selected for fetching, exiting > > On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: > > > > I have been able to compile under OpenJDK 11 > > Have not done anything further so far > > I'm gonna try to get to it this evening > > > > Greetz > > Ralf > > > > On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma > > wrote: > > > > > > Hi, > > > > > > Everything seems fine, the crawler seems fine when trying the binary > > > distribution. The source won't work because this computer still cannot > > > compile it. Clearing the local Ivy cache did not do much. This is the > > > known > > > compiler error with the elastic-indexer plugin: > > > compile: > > > [echo] Compiling plugin: indexer-elastic > > >[javac] Compiling 3 source files to > > > /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes > > >[javac] > > > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: > > > error: package org.apache.http.impl.nio.client does not exist > > >[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; > > >[javac] ^ > > >[javac] 1 error > > > > > > > > > The binary distribution works fine though. I do see a lot of new messages > > > when fetching: > > > 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner > > > Map Task Executor #0] Found 0 extensions at > > > point:'org.apache.nutch.net.URLExemptionFilter' > > > > > > This is also new at start of each task: > > > SLF4J: Class path contains multiple SLF4J bindings. > > > SLF4J: Found binding in > > > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > > > > > SLF4J: Found binding in > > > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > > > > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > > > explanation. > > > SLF4J: Actual binding is of type > > > [org.apache.logging.slf4j.Log4jLoggerFactory] > > > > > > And this one at the end of fetcher: > > > log4j:WARN No appenders could be found for logger > > > (org.apache.commons.httpclient.params.DefaultHttpParams). > > > log4j:WARN Please initialize the log4j system properly. > > > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > > > more info. > > > > > > I am worried about the indexer-elastic plugin, maybe others have that > > > problem too? Otherwise everything seems fine. > > > > > > Markus > > > > > > Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel : > > > > > > > Hi Folks, > > > > > > > > A first candidate for the Nutch 1.19 release is available at: > > > > > > > >https://dist.apache.org/repos/dist/dev/nutch/1.19/ > > > > > > > > The release candidate is a zip and tar.gz archive of the binary and > > > > sources in: > > > >https://github.com/apache/nutch/tree/release-1.19 > > > > > > > > In addition, a staged maven repository is available here: > > > > > > > > https://repository.apache.org/content/repositories/orgapachenutch-1020 > > > > > > > > We addressed 87 issues: > > > >https://s.apache.org/lf6li > > > > > > > > > > > > Please vote on releasing this package as Apache Nutch 1.19. > > > > The vote is open for the next 72 hours and passes if a majority > > > > of at least three +1 Nutch PMC votes are cast. > > > > > > > > [ ] +1 Release this package as Apache Nutch 1.19. > > > > [ ] -1 Do not release this package because… > > > > > > > > Cheers, > > > > Sebastian > > > > (On behalf of the Nutch PMC) > > > > > > > > P.S. > > > > Here is my +1. > > > > - tested most of Nutch tools and run a test crawl on a single-node > > > > cluster > > > > running Hadoop 3.3.4, see > > > > https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) > > > >
Re: [VOTE] Release Apache Nutch 1.19 RC#1
so far... it doesn't select anything when creating segments: 0 records selected for fetching, exiting On Wed, Aug 24, 2022 at 3:02 PM BlackIce wrote: > > I have been able to compile under OpenJDK 11 > Have not done anything further so far > I'm gonna try to get to it this evening > > Greetz > Ralf > > On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma > wrote: > > > > Hi, > > > > Everything seems fine, the crawler seems fine when trying the binary > > distribution. The source won't work because this computer still cannot > > compile it. Clearing the local Ivy cache did not do much. This is the known > > compiler error with the elastic-indexer plugin: > > compile: > > [echo] Compiling plugin: indexer-elastic > >[javac] Compiling 3 source files to > > /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes > > [javac] > > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: > > error: package org.apache.http.impl.nio.client does not exist > >[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; > >[javac] ^ > >[javac] 1 error > > > > > > The binary distribution works fine though. I do see a lot of new messages > > when fetching: > > 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner > > Map Task Executor #0] Found 0 extensions at > > point:'org.apache.nutch.net.URLExemptionFilter' > > > > This is also new at start of each task: > > SLF4J: Class path contains multiple SLF4J bindings. > > SLF4J: Found binding in > > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > > > SLF4J: Found binding in > > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > > explanation. > > SLF4J: Actual binding is of type > > [org.apache.logging.slf4j.Log4jLoggerFactory] > > > > And this one at the end of fetcher: > > log4j:WARN No appenders could be found for logger > > (org.apache.commons.httpclient.params.DefaultHttpParams). > > log4j:WARN Please initialize the log4j system properly. > > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > > more info. > > > > I am worried about the indexer-elastic plugin, maybe others have that > > problem too? Otherwise everything seems fine. > > > > Markus > > > > Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel : > > > > > Hi Folks, > > > > > > A first candidate for the Nutch 1.19 release is available at: > > > > > >https://dist.apache.org/repos/dist/dev/nutch/1.19/ > > > > > > The release candidate is a zip and tar.gz archive of the binary and > > > sources in: > > >https://github.com/apache/nutch/tree/release-1.19 > > > > > > In addition, a staged maven repository is available here: > > >https://repository.apache.org/content/repositories/orgapachenutch-1020 > > > > > > We addressed 87 issues: > > >https://s.apache.org/lf6li > > > > > > > > > Please vote on releasing this package as Apache Nutch 1.19. > > > The vote is open for the next 72 hours and passes if a majority > > > of at least three +1 Nutch PMC votes are cast. > > > > > > [ ] +1 Release this package as Apache Nutch 1.19. > > > [ ] -1 Do not release this package because… > > > > > > Cheers, > > > Sebastian > > > (On behalf of the Nutch PMC) > > > > > > P.S. > > > Here is my +1. > > > - tested most of Nutch tools and run a test crawl on a single-node cluster > > > running Hadoop 3.3.4, see > > > https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) > > >
Re: [VOTE] Release Apache Nutch 1.19 RC#1
I have been able to compile under OpenJDK 11 Have not done anything further so far I'm gonna try to get to it this evening Greetz Ralf On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma wrote: > > Hi, > > Everything seems fine, the crawler seems fine when trying the binary > distribution. The source won't work because this computer still cannot > compile it. Clearing the local Ivy cache did not do much. This is the known > compiler error with the elastic-indexer plugin: > compile: > [echo] Compiling plugin: indexer-elastic >[javac] Compiling 3 source files to > /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes > [javac] > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: > error: package org.apache.http.impl.nio.client does not exist >[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; >[javac] ^ >[javac] 1 error > > > The binary distribution works fine though. I do see a lot of new messages > when fetching: > 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner > Map Task Executor #0] Found 0 extensions at > point:'org.apache.nutch.net.URLExemptionFilter' > > This is also new at start of each task: > SLF4J: Class path contains multiple SLF4J bindings. > SLF4J: Found binding in > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: Found binding in > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an > explanation. > SLF4J: Actual binding is of type > [org.apache.logging.slf4j.Log4jLoggerFactory] > > And this one at the end of fetcher: > log4j:WARN No appenders could be found for logger > (org.apache.commons.httpclient.params.DefaultHttpParams). > log4j:WARN Please initialize the log4j system properly. > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for > more info. > > I am worried about the indexer-elastic plugin, maybe others have that > problem too? Otherwise everything seems fine. > > Markus > > Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel : > > > Hi Folks, > > > > A first candidate for the Nutch 1.19 release is available at: > > > >https://dist.apache.org/repos/dist/dev/nutch/1.19/ > > > > The release candidate is a zip and tar.gz archive of the binary and > > sources in: > >https://github.com/apache/nutch/tree/release-1.19 > > > > In addition, a staged maven repository is available here: > >https://repository.apache.org/content/repositories/orgapachenutch-1020 > > > > We addressed 87 issues: > >https://s.apache.org/lf6li > > > > > > Please vote on releasing this package as Apache Nutch 1.19. > > The vote is open for the next 72 hours and passes if a majority > > of at least three +1 Nutch PMC votes are cast. > > > > [ ] +1 Release this package as Apache Nutch 1.19. > > [ ] -1 Do not release this package because… > > > > Cheers, > > Sebastian > > (On behalf of the Nutch PMC) > > > > P.S. > > Here is my +1. > > - tested most of Nutch tools and run a test crawl on a single-node cluster > > running Hadoop 3.3.4, see > > https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) > >
Re: [VOTE] Release Apache Nutch 1.19 RC#1
Hi, Everything seems fine, the crawler seems fine when trying the binary distribution. The source won't work because this computer still cannot compile it. Clearing the local Ivy cache did not do much. This is the known compiler error with the elastic-indexer plugin: compile: [echo] Compiling plugin: indexer-elastic [javac] Compiling 3 source files to /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes [javac] /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39: error: package org.apache.http.impl.nio.client does not exist [javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder; [javac] ^ [javac] 1 error The binary distribution works fine though. I do see a lot of new messages when fetching: 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner Map Task Executor #0] Found 0 extensions at point:'org.apache.nutch.net.URLExemptionFilter' This is also new at start of each task: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] And this one at the end of fetcher: log4j:WARN No appenders could be found for logger (org.apache.commons.httpclient.params.DefaultHttpParams). log4j:WARN Please initialize the log4j system properly. log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info. I am worried about the indexer-elastic plugin, maybe others have that problem too? Otherwise everything seems fine. Markus Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel : > Hi Folks, > > A first candidate for the Nutch 1.19 release is available at: > >https://dist.apache.org/repos/dist/dev/nutch/1.19/ > > The release candidate is a zip and tar.gz archive of the binary and > sources in: > https://github.com/apache/nutch/tree/release-1.19 > > In addition, a staged maven repository is available here: >https://repository.apache.org/content/repositories/orgapachenutch-1020 > > We addressed 87 issues: >https://s.apache.org/lf6li > > > Please vote on releasing this package as Apache Nutch 1.19. > The vote is open for the next 72 hours and passes if a majority > of at least three +1 Nutch PMC votes are cast. > > [ ] +1 Release this package as Apache Nutch 1.19. > [ ] -1 Do not release this package because… > > Cheers, > Sebastian > (On behalf of the Nutch PMC) > > P.S. > Here is my +1. > - tested most of Nutch tools and run a test crawl on a single-node cluster > running Hadoop 3.3.4, see > https://github.com/sebastian-nagel/nutch-test-single-node-cluster/) >
[VOTE] Release Apache Nutch 1.19 RC#1
Hi Folks, A first candidate for the Nutch 1.19 release is available at: https://dist.apache.org/repos/dist/dev/nutch/1.19/ The release candidate is a zip and tar.gz archive of the binary and sources in: https://github.com/apache/nutch/tree/release-1.19 In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapachenutch-1020 We addressed 87 issues: https://s.apache.org/lf6li Please vote on releasing this package as Apache Nutch 1.19. The vote is open for the next 72 hours and passes if a majority of at least three +1 Nutch PMC votes are cast. [ ] +1 Release this package as Apache Nutch 1.19. [ ] -1 Do not release this package because… Cheers, Sebastian (On behalf of the Nutch PMC) P.S. Here is my +1. - tested most of Nutch tools and run a test crawl on a single-node cluster running Hadoop 3.3.4, see https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
Re: Question about Nutch plugins
Hi Rastko, the description isn't really correct now as NUTCH_HOME is supposed to point to the runtime - if the binary package is used: this is the base folder of the package, eg. apache-nutch-1.18/ - if Nutch is built from the source, you usually point NUTCH_HOME to runtime/local/ - the directory tree below this folder looks pretty much the same as the binary package Older versions of Nutch hadn't this separation of source and runtime. > I use nutch by just unzipping apache-nutch-1.17-bin.tar.gz)? If you want to build your own plugin, I'd recommend to start using the Nutch source package, or even the current master by cloning the Nutch git repository. As always for a community project: feel free to improve the tutorial, obviously it might be out of date. Best, Sebastian On 7/23/22 13:28, Rastko.pavlovic wrote: > Hi all, > > I've been trying to implement this tutorial > https://cwiki.apache.org/confluence/display/nutch/WritingPluginExample on > Nutch 1.17. In several places, the tutorial refers to $NUTCH_HOME/src/plugin. > However, in my $NUTCH_HOME I only have a "plugin" directory and no src. If I > try building the plugin in a sub directory of the "plugin" directory with > ant, I get a problem where build.xml complains that it can't find > build-plugin.xml. > > Does anyone maybe know what am I doing wrong (in case it helps, I use nutch > by just unzipping apache-nutch-1.17-bin.tar.gz)? > > Many thanks in advance. > > Best regards, > Rastko >
Question about Nutch plugins
Hi all, I've been trying to implement this tutorial https://cwiki.apache.org/confluence/display/nutch/WritingPluginExample on Nutch 1.17. In several places, the tutorial refers to $NUTCH_HOME/src/plugin. However, in my $NUTCH_HOME I only have a "plugin" directory and no src. If I try building the plugin in a sub directory of the "plugin" directory with ant, I get a problem where build.xml complains that it can't find build-plugin.xml. Does anyone maybe know what am I doing wrong (in case it helps, I use nutch by just unzipping apache-nutch-1.17-bin.tar.gz)? Many thanks in advance. Best regards, Rastko Sent with [Proton Mail](https://proton.me/) secure email.
Re: Problem with Nutch <-> Eclipse
Hello Sebastian, each time you helped me I am reminded how grateful I am there are people like you! I am following these instructions: *https://cwiki.apache.org/confluence/display/nutch/RunNutchInEclipse <https://cwiki.apache.org/confluence/display/nutch/RunNutchInEclipse>* The most prevalent error is: *The package org.w3c.dom is accessible from more than one module: , java.xml* This message occurs for a few other packages as well, but I figure if you can help me fix one, I will be more able to fix the others. There is also an error in LuceneAnalyzerUtil.java : *STOP_WORDS_SET cannot be resolved or is not a field* *Enjoy your day and thank you,* *...bob* On Tue, Jul 19, 2022 at 2:48 AM Sebastian Nagel wrote: > Hi Bob, > > could you share which instructions and when the error happens - during > import, > project build, running/debugging? > > The usual way is > > 1. to write the Eclipse project configuration, run > >ant eclipse > > 2. import the written project configuration into Eclipse > > > Building or running/debugging Nutch in Eclipse is possible although > requires > some work to get everything right. > > > Best, > Sebastian > > > On 7/15/22 23:00, Robert Scavilla wrote: > > Hello Kind People, I am trying to set up Nutch with eclipse. I am > following > > the instructions and have an issue that I have not been able to resolve > > yet. I have the error: *"package org.w3c.dom is accessible from more than > > one module"* > > > > There are several modules that get this same error message. The project > > compiles from the command line without error. It is not clear to me how > to > > resolve this and I hope you can help. > > > > Thank you! > > ...bob > > >
Re: Problem with Nutch <-> Eclipse
Hi Bob, could you share which instructions and when the error happens - during import, project build, running/debugging? The usual way is 1. to write the Eclipse project configuration, run ant eclipse 2. import the written project configuration into Eclipse Building or running/debugging Nutch in Eclipse is possible although requires some work to get everything right. Best, Sebastian On 7/15/22 23:00, Robert Scavilla wrote: > Hello Kind People, I am trying to set up Nutch with eclipse. I am following > the instructions and have an issue that I have not been able to resolve > yet. I have the error: *"package org.w3c.dom is accessible from more than > one module"* > > There are several modules that get this same error message. The project > compiles from the command line without error. It is not clear to me how to > resolve this and I hope you can help. > > Thank you! > ...bob >
Problem with Nutch <-> Eclipse
Hello Kind People, I am trying to set up Nutch with eclipse. I am following the instructions and have an issue that I have not been able to resolve yet. I have the error: *"package org.w3c.dom is accessible from more than one module"* There are several modules that get this same error message. The project compiles from the command line without error. It is not clear to me how to resolve this and I hope you can help. Thank you! ...bob
Re: Does Nutch work with Hadoop Versions greater than 3.1.3?
To add to Sebastian, it runs on Hadoop 3.3.x very good as well. Actually, i never had any Hadoop version that could not run Nutch out of the box and without issues. Op ma 13 jun. 2022 om 11:54 schreef Sebastian Nagel : > Hi Michael, > > Nutch (1.18, and trunk/master) should work together with more recent Hadoop > versions. > > At Common Crawl we use a modified Nutch version based on the recent trunk > running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop > cluster > with x64 and arm64 AWS EC2 instances. > > But I'm sure there are more possible combinations. > > One important note: in trunk/master there is a yet unsolved regression > caused by > the newly introduced plugin-based URL stream handlers, see NUTCH-2936 and > NUTCH-2949. Unless these are resolved, you need to undo these commits in > order > to run Nutch (built from trunk/master) in distributed mode. > > Best, > Sebastian > > On 6/13/22 01:37, Michael Coffey wrote: > > Do current 1.x versions of Nutch (1.18, and trunk/master) work with > versions of Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from > October 2019, and there are many newer versions available. For example, > 3.1.4 came out in 2020, and there are 3.2.x and 3.3.x versions that came > out this year. > > > > I don’t care about newer features in Hadoop, I just have general > concerns about stability and security. I am working on reviving an old > project and would like to put together the best possible infrastructure for > the future. > > > > >
Re: Does Nutch work with Hadoop Versions greater than 3.1.3?
Hi Michael, Nutch (1.18, and trunk/master) should work together with more recent Hadoop versions. At Common Crawl we use a modified Nutch version based on the recent trunk running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop cluster with x64 and arm64 AWS EC2 instances. But I'm sure there are more possible combinations. One important note: in trunk/master there is a yet unsolved regression caused by the newly introduced plugin-based URL stream handlers, see NUTCH-2936 and NUTCH-2949. Unless these are resolved, you need to undo these commits in order to run Nutch (built from trunk/master) in distributed mode. Best, Sebastian On 6/13/22 01:37, Michael Coffey wrote: > Do current 1.x versions of Nutch (1.18, and trunk/master) work with versions > of Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from October > 2019, and there are many newer versions available. For example, 3.1.4 came > out in 2020, and there are 3.2.x and 3.3.x versions that came out this year. > > I don’t care about newer features in Hadoop, I just have general concerns > about stability and security. I am working on reviving an old project and > would like to put together the best possible infrastructure for the future. > >
Does Nutch work with Hadoop Versions greater than 3.1.3?
Do current 1.x versions of Nutch (1.18, and trunk/master) work with versions of Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from October 2019, and there are many newer versions available. For example, 3.1.4 came out in 2020, and there are 3.2.x and 3.3.x versions that came out this year. I don’t care about newer features in Hadoop, I just have general concerns about stability and security. I am working on reviving an old project and would like to put together the best possible infrastructure for the future.
RE: Nutch not crawling all URLs
Hi, Just continuing this thread, I tried the Selenium plugin as suggested below. I have copied over the nutch-site.xml file to show the parameters set for the selenium plugin below. I have taken most of the descriptions out for brevity: http.agent.name Esid Crawler http.agent.email roselineantai at gmail dot com http.agent.url http://esid.shinyapps.io/ESID/ db.ignore.also.redirects false I db.fetch.interval.default 30 The default number of seconds between re-fetches of a page (30 days). db.ignore.internal.links false db.ignore.external.links true parser.skip.truncated false Boolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes take. db.max.outlinks.per.page -1 http.content.limit -1 db.ignore.external.links.mode byHost db.injector.overwrite true http.timeout 10 The default network timeout, in milliseconds. plugin.includes protocol-selenium|urlfilter-regex|parse-tika|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier selenium.driver chrome selenium.take.screenshot false selenium.screenshot.location selenium.hub.port Selenium Hub Location connection port selenium.hub.path /wd/hub Selenium Hub Location connection path selenium.hub.host localhost Selenium Hub Location connection host selenium.hub.protocol http Selenium Hub Location connection protocol selenium.grid.driver chrome selenium.grid.binary /usr/bin/chromedriver libselenium.page.load.delay 3 webdriver.chrome.driver /root/chromedriver The path to the ChromeDriver binary selenium.enable.headless true A Boolean value representing the headless option for Firefix and Chrome drivers When I tested the setup using this: bin/nutch parsechecker \ -Dplugin.includes='protocol-selenium|parse-tika' \ -Dselenium.grid.binary=/path/to/selenium/chromedriver \ -Dselenium.driver=chrome \ -Dselenium.enable.headless=true \ -followRedirects -dumpText URL With some of the problematic URLs, they all came out well on the console. They however were quite a number of URLs identified as outlinks. But when I run the full crawl with this plug-in, it appears to show some data in Solr, but I have been unable to extract any data. It gives '0' as count of what has been crawled, for all the URLs. This is quite worrying, because without the plugin, I did manage to get data from about half of the URLs. The performance is way worse than it should be. I'm also confused because testing some of the sites with the example I was given above works. Below is a sample of the errors I got from the log files. Please have a look at them and let me know if there is a parameter I'm not setting properly: 2022-02-15 01:49:02,093 ERROR tika.TikaParser - Problem loading custom Tika configuration from tika-config.xml java.lang.NumberFormatException: For input string: "" 2022-02-15 13:29:21,331 ERROR selenium.Http - Failed to get protocol output java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown error: net::ERR_NAME_NOT_RESOLVED (Session info: headless chrome=96.0.4664.110) Caused by: org.openqa.selenium.WebDriverException: unknown error: net::ERR_NAME_NOT_RESOLVED (Session info: headless chrome=96.0.4664.110) *** Element info: {Using=tag name, value=body} 2022-02-15 13:29:23,971 ERROR selenium.Http - Failed to get protocol output java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: no such element: Unable to locate element: {"method":"css selector","selector":"body"} (Session info: headless chrome=96.0.4664.110) For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html 2022-02-15 13:29:23,972 INFO fetcher.FetcherThread - FetcherThread 71 fetch of http://ialab.com.ar/ failed with: java.lang.RuntimeException: org.openqa.selenium.NoSuchEle> (Session info: headless chrome=96.0.4664.110) For documentation on this error, please visit: http://seleniumhq.org/exceptions/no_such_element.html 2022-02-15 13:29:27,648 ERROR selenium.HttpWebClient - Selenium WebDriver: Timeout Exception: Capturing whatever loaded so far... 2022-02-15 13:32:42,713 INFO regex.RegexURLNormalizer - can't find rules for scope 'outlink', using default 2022-02-15 13:33:23,664 INFO regex.RegexURLNormalizer - can't find rules for scope 'fetcher', using default 2022-02-15 13:36:23,347 INFO parse.ParserFactory - The parsing plugins: [org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes system property, and >2022-02-15 13:36:23,
RE: Nutch not crawling all URLs
Thank you Sebastian. I will try these. Kind regards, Roseline -Original Message- From: Sebastian Nagel Sent: 13 January 2022 12:33 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs Hi Roseline, > Does it work at all with Chrome? Yes. > It seems you need to have some form of GUI to run it? You need graphics libraries but not necessarily a graphical system. Normally, you run the browser in headless mode without a graphical device (monitor) attached. > Is there some documentation or tutorial on this? The README is probably the best documentation: src/plugin/protocol-selenium/README.md https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-seleniumdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C32d425eaebf34b01ecb008d9d690d0a1%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776741438542976%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=ybLqkJoR3ZMsMSQO7cB3cvdnFk%2F3%2F9JDDds0yA%2BpyVk%3Dreserved=0 After installing chromium and the Selenium chromedriver, you can test whether it works by running: bin/nutch parsechecker \ -Dplugin.includes='protocol-selenium|parse-tika' \ -Dselenium.grid.binary=/path/to/selenium/chromedriver \ -Dselenium.driver=chrome \ -Dselenium.enable.headless=true \ -followRedirects -dumpText URL Caveat: because browsers are updated frequently, you may need to use a recent driver version and eventually also upgrade the Selenium dependencies in Nutch. Let us know if you need help here. > My use case is Text mining and Machine Learning classification. I'm > indexing into Solr and then transferring the indexed data to MongoDB > for further processing. Well, that's not an untypical use case for Nutch. And it's a long pipeline: fetching, HTML parsing, extracting content fields, indexing. Nutch is able to perform all steps. But I'd agree that browser-based crawling isn't that easy to set up with Nutch. Best, Sebastian On 1/12/22 17:53, Roseline Antai wrote: > Hi Sebastian, > > Thank you. I did enjoy the holiday. Hope you did too. > > I have had a look at the protocol-selenium plugin, but it was a bit difficult > to understand. It appears it only works with Firefox. Does it work at all > with Chrome? I was also not sure of what values to set for the properties. It > seems you need to have some form of GUI to run it? > > Is there some documentation or tutorial on this? My guess is that some of the > pages might not be crawling because of JavaScript. I might be wrong, but > would want to test that. > > I think would be quite good for my use case because I am trying to implement > broad crawling. > > My use case is Text mining and Machine Learning classification. I'm indexing > into Solr and then transferring the indexed data to MongoDB for further > processing. > > Kind regards, > Roseline > > > > > > -Original Message----- > From: Sebastian Nagel > Sent: 12 January 2022 16:12 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all URLs > > Hi Roseline, > >> the mail below went to my junk folder and I didn't see it. > > No problem. I hope you nevertheless enjoyed the holidays. > And sorry for any delays but I want to emphasize that Nutch is a community > project and in doubt it might take a few days until somebody finds the time > to respond. > >> Could you confirm if you received all the urls I sent? > > I've tried a view URLs you sent but not all of them. And to figure out the > reason why a site isn't crawled may take some time. > >> Another question I have about Nutch is if it has problems with >> crawling javascript pages? > > By default Nutch does not execute Javascript. > > There is a protocol plugin (protocol-selenium) to fetch pages with a web > browser between Nutch and the crawled sites. This way Javascript pages can be > crawled for the price of some overhead in setting up the crawler and network > traffic to fetch the page dependencies (CSS, Javascript, images). > >> I would ideally love to make the crawler work for my URLs than start >> checking for other crawlers and waste all the work so far. > > Well, Nutch is for sure a good crawler. But as always: there are many other > crawlers which might be better adapted to a specific use case. > > What's your use case? Indexing into Solr or Elasticsearch? > Text mining? Archiving content? > > Best, > Sebastian > > On 1/12/22 12:13, Roseline Antai wrote: >> Hi Sebastian, >> >> For some reason, the mail below went to my junk folder and I didn't see it. >> >> The notco page - >> https://eur02.safelinks.pro
Re: Nutch not crawling all URLs
Hi Roseline, > Does it work at all with Chrome? Yes. > It seems you need to have some form of GUI to run it? You need graphics libraries but not necessarily a graphical system. Normally, you run the browser in headless mode without a graphical device (monitor) attached. > Is there some documentation or tutorial on this? The README is probably the best documentation: src/plugin/protocol-selenium/README.md https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium After installing chromium and the Selenium chromedriver, you can test whether it works by running: bin/nutch parsechecker \ -Dplugin.includes='protocol-selenium|parse-tika' \ -Dselenium.grid.binary=/path/to/selenium/chromedriver \ -Dselenium.driver=chrome \ -Dselenium.enable.headless=true \ -followRedirects -dumpText URL Caveat: because browsers are updated frequently, you may need to use a recent driver version and eventually also upgrade the Selenium dependencies in Nutch. Let us know if you need help here. > My use case is Text mining and Machine Learning classification. I'm indexing > into Solr and then transferring the indexed data to MongoDB for further > processing. Well, that's not an untypical use case for Nutch. And it's a long pipeline: fetching, HTML parsing, extracting content fields, indexing. Nutch is able to perform all steps. But I'd agree that browser-based crawling isn't that easy to set up with Nutch. Best, Sebastian On 1/12/22 17:53, Roseline Antai wrote: > Hi Sebastian, > > Thank you. I did enjoy the holiday. Hope you did too. > > I have had a look at the protocol-selenium plugin, but it was a bit difficult > to understand. It appears it only works with Firefox. Does it work at all > with Chrome? I was also not sure of what values to set for the properties. It > seems you need to have some form of GUI to run it? > > Is there some documentation or tutorial on this? My guess is that some of the > pages might not be crawling because of JavaScript. I might be wrong, but > would want to test that. > > I think would be quite good for my use case because I am trying to implement > broad crawling. > > My use case is Text mining and Machine Learning classification. I'm indexing > into Solr and then transferring the indexed data to MongoDB for further > processing. > > Kind regards, > Roseline > > > > > > -Original Message----- > From: Sebastian Nagel > Sent: 12 January 2022 16:12 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all URLs > > Hi Roseline, > >> the mail below went to my junk folder and I didn't see it. > > No problem. I hope you nevertheless enjoyed the holidays. > And sorry for any delays but I want to emphasize that Nutch is a community > project and in doubt it might take a few days until somebody finds the time > to respond. > >> Could you confirm if you received all the urls I sent? > > I've tried a view URLs you sent but not all of them. And to figure out the > reason why a site isn't crawled may take some time. > >> Another question I have about Nutch is if it has problems with >> crawling javascript pages? > > By default Nutch does not execute Javascript. > > There is a protocol plugin (protocol-selenium) to fetch pages with a web > browser between Nutch and the crawled sites. This way Javascript pages can be > crawled for the price of some overhead in setting up the crawler and network > traffic to fetch the page dependencies (CSS, Javascript, images). > >> I would ideally love to make the crawler work for my URLs than start >> checking for other crawlers and waste all the work so far. > > Well, Nutch is for sure a good crawler. But as always: there are many other > crawlers which might be better adapted to a specific use case. > > What's your use case? Indexing into Solr or Elasticsearch? > Text mining? Archiving content? > > Best, > Sebastian > > On 1/12/22 12:13, Roseline Antai wrote: >> Hi Sebastian, >> >> For some reason, the mail below went to my junk folder and I didn't see it. >> >> The notco page - >> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3Dreserved=0 >> was not indexed, no. When I enabled redirects, I was able to get a few >> pages, but they don't seem valid. >> >> Could you confirm if you received all the urls I sent
RE: Nutch not crawling all URLs
Hi Sebastian, Thank you. I did enjoy the holiday. Hope you did too. I have had a look at the protocol-selenium plugin, but it was a bit difficult to understand. It appears it only works with Firefox. Does it work at all with Chrome? I was also not sure of what values to set for the properties. It seems you need to have some form of GUI to run it? Is there some documentation or tutorial on this? My guess is that some of the pages might not be crawling because of JavaScript. I might be wrong, but would want to test that. I think would be quite good for my use case because I am trying to implement broad crawling. My use case is Text mining and Machine Learning classification. I'm indexing into Solr and then transferring the indexed data to MongoDB for further processing. Kind regards, Roseline -Original Message- From: Sebastian Nagel Sent: 12 January 2022 16:12 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs Hi Roseline, > the mail below went to my junk folder and I didn't see it. No problem. I hope you nevertheless enjoyed the holidays. And sorry for any delays but I want to emphasize that Nutch is a community project and in doubt it might take a few days until somebody finds the time to respond. > Could you confirm if you received all the urls I sent? I've tried a view URLs you sent but not all of them. And to figure out the reason why a site isn't crawled may take some time. > Another question I have about Nutch is if it has problems with > crawling javascript pages? By default Nutch does not execute Javascript. There is a protocol plugin (protocol-selenium) to fetch pages with a web browser between Nutch and the crawled sites. This way Javascript pages can be crawled for the price of some overhead in setting up the crawler and network traffic to fetch the page dependencies (CSS, Javascript, images). > I would ideally love to make the crawler work for my URLs than start > checking for other crawlers and waste all the work so far. Well, Nutch is for sure a good crawler. But as always: there are many other crawlers which might be better adapted to a specific use case. What's your use case? Indexing into Solr or Elasticsearch? Text mining? Archiving content? Best, Sebastian On 1/12/22 12:13, Roseline Antai wrote: > Hi Sebastian, > > For some reason, the mail below went to my junk folder and I didn't see it. > > The notco page - > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3Dreserved=0 > was not indexed, no. When I enabled redirects, I was able to get a few > pages, but they don't seem valid. > > Could you confirm if you received all the urls I sent? > > Another question I have about Nutch is if it has problems with crawling > javascript pages? > > I would ideally love to make the crawler work for my URLs than start checking > for other crawlers and waste all the work so far. > > Just adding again, this is what my nutch-site.xml looks like: > > > > > > > > http.agent.name > Nutch Crawler > > > http.agent.email > datalake.ng at gmail d > db.ignore.internal.links > false > > > db.ignore.external.links > true > > > plugin.includes > > protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an > chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu > age-identifier > > > parser.skip.truncated > false > Boolean value for whether we should skip parsing for > truncated documents. By default this > property is activated due to extremely high levels of CPU which > parsing can sometimes take. > > > >db.max.outlinks.per.page >-1 >The maximum number of outlinks that we'll process for a page. >If this value is nonnegative (>=0), at most db.max.outlinks.per.page > outlinks >will be processed for a page; otherwise, all outlinks will be processed. > > > > http.content.limit > -1 > The length limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > > > > db.ignore.external.links.mode > byHost > > > db.injector.overwrite > true > > > http.timeout > 5 > The default network time
Re: Nutch not crawling all URLs
Hi Roseline, > the mail below went to my junk folder and I didn't see it. No problem. I hope you nevertheless enjoyed the holidays. And sorry for any delays but I want to emphasize that Nutch is a community project and in doubt it might take a few days until somebody finds the time to respond. > Could you confirm if you received all the urls I sent? I've tried a view URLs you sent but not all of them. And to figure out the reason why a site isn't crawled may take some time. > Another question I have about Nutch is if it has problems with crawling > javascript pages? By default Nutch does not execute Javascript. There is a protocol plugin (protocol-selenium) to fetch pages with a web browser between Nutch and the crawled sites. This way Javascript pages can be crawled for the price of some overhead in setting up the crawler and network traffic to fetch the page dependencies (CSS, Javascript, images). > I would ideally love to make the crawler work for my URLs than start checking > for other crawlers and waste all the work so far. Well, Nutch is for sure a good crawler. But as always: there are many other crawlers which might be better adapted to a specific use case. What's your use case? Indexing into Solr or Elasticsearch? Text mining? Archiving content? Best, Sebastian On 1/12/22 12:13, Roseline Antai wrote: > Hi Sebastian, > > For some reason, the mail below went to my junk folder and I didn't see it. > > The notco page - https://notco.com/ was not indexed, no. When I enabled > redirects, I was able to get a few pages, but they don't seem valid. > > Could you confirm if you received all the urls I sent? > > Another question I have about Nutch is if it has problems with crawling > javascript pages? > > I would ideally love to make the crawler work for my URLs than start checking > for other crawlers and waste all the work so far. > > Just adding again, this is what my nutch-site.xml looks like: > > > > > > > > http.agent.name > Nutch Crawler > > > http.agent.email > datalake.ng at gmail d > > > db.ignore.internal.links > false > > > db.ignore.external.links > true > > > plugin.includes > > protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier > > > parser.skip.truncated > false > Boolean value for whether we should skip parsing for > truncated documents. By default this > property is activated due to extremely high levels of CPU which > parsing can sometimes take. > > > >db.max.outlinks.per.page >-1 >The maximum number of outlinks that we'll process for a page. >If this value is nonnegative (>=0), at most db.max.outlinks.per.page > outlinks >will be processed for a page; otherwise, all outlinks will be processed. > > > > http.content.limit > -1 > The length limit for downloaded content using the http:// > protocol, in bytes. If this value is nonnegative (>=0), content longer > than it will be truncated; otherwise, no truncation at all. Do not > confuse this setting with the file.content.limit setting. > > > > db.ignore.external.links.mode > byHost > > > db.injector.overwrite > true > > > http.timeout > 5 > The default network timeout, in milliseconds. > > > > Regards, > Roseline > > -Original Message- > From: Sebastian Nagel > Sent: 13 December 2021 17:35 > To: user@nutch.apache.org > Subject: Re: Nutch not crawling all URLs > > CAUTION: This email originated outside the University. Check before clicking > links or attachments. > > Hi Roseline, > >> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3Dreserved=0 > > What is the status for > https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhichdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3Dreserved=0 > is the final redirect > target? > Is the target page indexed? > > ~Sebastian >
RE: Nutch not crawling all URLs
Hi Sebastian, For some reason, the mail below went to my junk folder and I didn't see it. The notco page - https://notco.com/ was not indexed, no. When I enabled redirects, I was able to get a few pages, but they don't seem valid. Could you confirm if you received all the urls I sent? Another question I have about Nutch is if it has problems with crawling javascript pages? I would ideally love to make the crawler work for my URLs than start checking for other crawlers and waste all the work so far. Just adding again, this is what my nutch-site.xml looks like: http.agent.name Nutch Crawler http.agent.email datalake.ng at gmail d db.ignore.internal.links false db.ignore.external.links true plugin.includes protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier parser.skip.truncated false Boolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes take. db.max.outlinks.per.page -1 The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. http.content.limit -1 The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. db.ignore.external.links.mode byHost db.injector.overwrite true http.timeout 5 The default network timeout, in milliseconds. Regards, Roseline -Original Message- From: Sebastian Nagel Sent: 13 December 2021 17:35 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs CAUTION: This email originated outside the University. Check before clicking links or attachments. Hi Roseline, > 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3Dreserved=0 What is the status for https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhichdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3Dreserved=0 is the final redirect target? Is the target page indexed? ~Sebastian
!! Join the #nutch Slack channel !!
Hi user@, dev@, I took the liberty of setting up a #nutch channel for our community to communicate in a lower latency manner. First join the-asf.slack.com Slack workspace https://infra.apache.org/slack.html Then simply join the #nutch channel. See you there :) Thanks lewismc -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
RE: Nutch not crawling all URLs
Hi, Following on from my previous enquiry, I was told to send the URLs I was trying to crawl to be tried from your end. I sent these, but did not receive any confirmation of receipt. Can you please confirm if these have been received, and when I can look forward to getting some feedback? I re-crawled the 20 URLs again and reset these values to the default values from the nutch-default.xml file: property> fetcher.server.delay 65.0 fetcher.server.min.delay 25.0 fetcher.max.crawl.delay 70 I then set the ignore external links to false, as below: db.ignore.external.links false I set the following property to 'true' still: db.ignore.also.redirects true If true, the fetcher checks redirects the same way as links when ignoring internal or external links. Set to false to follow redirects despite the values for db.ignore.external.links and db.ignore.internal.links. 13 URLs were fetched, but of these, the URLs that were originally not fetched returned very few pages related to the domain in the URL, and this makes me question the crawl. Also, when external links are not ignored, the crawler does go off onto different sites, like Wikipedia, news sites, etc. This is hardly efficient as it spends so long on the crawl fetching irrelevant pages. How can this be controlled in Nutch? If crawling up to 900 URLs as we are going to be doing, will we have to write regex expressions for each URL in the regex-urlfilter in order to stick to the domains in the URL? There is no explicit documentation on how to do this in Nutch, unless I have missed it? Is there something that should be done that I'm not doing, or is Nutch just incapable of efficient crawling? Regards, Roseline When I crawled, Dr Roseline Antai Research Fellow Hunter Centre for Entrepreneurship Strathclyde Business School University of Strathclyde, Glasgow, UK [Small eMail Sig] The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. From: Roseline Antai Sent: 13 December 2021 12:02 To: 'user@nutch.apache.org' Subject: Nutch not crawling all URLs Hi, I am working with Apache nutch 1.18 and Solr. I have set up the system successfully, but I'm now having the problem that Nutch is refusing to crawl all the URLs. I am now at a loss as to what I should do to correct this problem. It fetches about half of the URLs in the seed.txt file. For instance, when I inject 20 URLs, only 9 are fetched. I have made a number of changes based on the suggestions I saw on the Nutch forum, as well as on Stack overflow, but nothing seems to work. This is what my nutch-site.xml file looks like: http.agent.name Nutch Crawler http.agent.email datalake.ng at gmail d db.ignore.internal.links false db.ignore.external.links true plugin.includes protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier parser.skip.truncated false Boolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes take. db.max.outlinks.per.page -1 The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. http.content.limit -1 The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. db.ignore.external.links.mode byDomain db.injector.overwrite true http.timeout 5 The default network timeout, in milliseconds. Other changes I have made include changing the following in nutch-default.xml: property> http.redirect.max 2 The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching. ** ftp.timeoutftp://ftp.timeout> 10 ftp.server.timeoutftp://ftp.server.timeout> 15 * property> fetcher.server.delay 65.0 fetcher.server.min.delay 25.0 fetcher.max.crawl.delay 70 I also commented out the line below in the regex-urlfilter file: # skip URLs containing certain characters as probable queries, etc. -[?*!@=] Nothing seems to work. What is it that I'm not doing, or doing wrongly here? Regards, Roseline Dr Roseline Antai Research
Re: Nutch not crawling all URLs
Hi Sebastian, yes that I mean. Do you think there is a way to learn more about, how to crawl any website?! >Hi Ayhan, >you mean? >https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt<https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt>
Re: Nutch not crawling all URLs
Hi Ayhan, you mean? https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt Sebastian On 12/13/21 20:59, Ayhan Koyun wrote: > Hi, > > as I wrote before, it seems that I am not the only one who can not crawl all > the seed.txt url's. I couldn't > find a solution really. I collected 450 domains and approximately 200 nutch > will or can not crawl. I want to > know why this happens, is there a solution to force crawling sites? > > It would be great to get a satisfying answer, to know why this happens and > maybe how to solve it. > > Thanks in advance > > Ayhan > >
RE: Nutch not crawling all URLs
Hi Lewis, I got a really weird reply back from what I sent, so I thought it better to resend the URLs again. I'm unsure if you got the URLs in the first instance. I've sent them as a text file attachment as well. http://traivefinance.com http://www.ceibal.edu.uy http://www.talovstudio.com https://portaltelemedicina.com.br/en/telediagnostic-platform http://www.notco.com http://www.saiph.org http://www.1doc3.com http://www.amanda-care.com http://www.unimadx.com http://www.upch.edu.pe/bioinformatic/anemia/app/ http://www.u-planner.com http://alerce.science http://paraempleo.mtess.gov.py http://layers.hemav.com http://www.sisben.gov.co http://ialab.com.ar http://www.kilimo.com.ar https://www.facebook.com/CIRSYS http://www.dymaxionlabs.com http://cedo.org Regards, Roseline Dr Roseline Antai Research Fellow Hunter Centre for Entrepreneurship Strathclyde Business School University of Strathclyde, Glasgow, UK The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. -Original Message- From: lewis john mcgibbney Sent: 13 December 2021 17:18 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs CAUTION: This email originated outside the University. Check before clicking links or attachments. Hi Roseline, Looks like you are ignoring external URLs... that could be the problem right there. I encourage you to track counters on inject, generate and fetch phases to understand where records may be being dropped. Are the seeds you are using public? If so please post your seed file so we can try. Thank you lewismc On Mon, Dec 13, 2021 at 04:02 wrote: > > user Digest 13 Dec 2021 12:02:41 - Issue 3132 > > Topics (messages 34682 through 34682) > > Nutch not crawling all URLs > 34682 by: Roseline Antai > > Administrivia: > > - > To post to the list, e-mail: user@nutch.apache.org To unsubscribe, > e-mail: user-digest-unsubscr...@nutch.apache.org > For additional commands, e-mail: user-digest-h...@nutch.apache.org > > -- > > > > > -- Forwarded message -- > From: Roseline Antai > To: "user@nutch.apache.org" > Cc: > Bcc: > Date: Mon, 13 Dec 2021 12:02:26 +0000 > Subject: Nutch not crawling all URLs > > Hi, > > > > I am working with Apache nutch 1.18 and Solr. I have set up the system > successfully, but I'm now having the problem that Nutch is refusing to > crawl all the URLs. I am now at a loss as to what I should do to > correct this problem. It fetches about half of the URLs in the seed.txt file. > > > > For instance, when I inject 20 URLs, only 9 are fetched. I have made a > number of changes based on the suggestions I saw on the Nutch forum, > as well as on Stack overflow, but nothing seems to work. > > > > This is what my nutch-site.xml file looks like: > > > > > > ** > > ** > > > > ** > > > > ** > > ** > > *http.agent.name > <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhttp > .agent.name%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3 > 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C > 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Or%2Ft4Sp > S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3Dreserved=0>* > > *Nutch Crawler* > > ** > > ** > > *http.agent.email * > > *datalake.ng at gmail d * > > ** > > ** > > *db.ignore.internal.links* > > *false* > > ** > > ** > > *db.ignore.external.links* > > *true* > > ** > > ** > > * plugin.includes* > > * > protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an > chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu > age-identifier* > > ** > > ** > > *parser.skip.truncated* > > *false* > > *Boolean value for whether we should skip parsing for > truncated documents. By default this* > > *property is activated due to extremely high levels of CPU which > parsing can sometimes take.* > > ** > > ** > > ** > > * db.max.outlinks.per.page > <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdb.m > ax.outlinks.per.page%2Fdata=04%7C01%7Croseline.antai%40strath.ac. > uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee594 > 4e%7C0%7C0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj > AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata= > hSYwQY8gfRV8uPs5X5jYS
Re: Nutch not crawling all URLs
Hi, as I wrote before, it seems that I am not the only one who can not crawl all the seed.txt url's. I couldn't find a solution really. I collected 450 domains and approximately 200 nutch will or can not crawl. I want to know why this happens, is there a solution to force crawling sites? It would be great to get a satisfying answer, to know why this happens and maybe how to solve it. Thanks in advance Ayhan
Re: Nutch not crawling all URLs
Hi Roseline, > 5,36405,0,http://www.notco.com What is the status for https://notco.com/which is the final redirect target? Is the target page indexed? ~Sebastian
RE: Nutch not crawling all URLs
Hi Lewis, Yes, there are public websites. Below are the 20 test URLs I've been trying to crawl. http://traivefinance.com http://www.ceibal.edu.uy http://www.talovstudio.com https://portaltelemedicina.com.br/en/telediagnostic-platform http://www.notco.com http://www.saiph.org http://www.1doc3.com http://www.amanda-care.com http://www.unimadx.com http://www.upch.edu.pe/bioinformatic/anemia/app/ http://www.u-planner.com http://alerce.science http://paraempleo.mtess.gov.py http://layers.hemav.com http://www.sisben.gov.co http://ialab.com.ar http://www.kilimo.com.ar https://www.facebook.com/CIRSYS http://www.dymaxionlabs.com http://cedo.org This is a count of the pages for the URLs crawled and not crawled. As can be seen, some are very large, while some are '0'. ,Project_id,Document Length,url 0,36400,0, http://www.trapview.com/v2/en/ 1,36401,0,http://traivefinance.com 2,36402,2344075,http://www.ceibal.edu.uy 3,36403,35072,http://www.talovstudio.com 4,36404,1384658,https://portaltelemedicina.com.br/en/telediagnostic-platform 5,36405,0,http://www.notco.com 6,36406,0,http://www.saiph.org 7,36407,246009,http://www.1doc3.com 8,36408,43190,http://www.amanda-care.com 9,36409,0,http://www.unimadx.com 10,36410,0,http://www.upch.edu.pe/bioinformatic/anemia/app/ 11,36411,0,http://www.u-planner.com 12,36412,8084,http://alerce.science 13,36413,0,http://paraempleo.mtess.gov.py 14,36414,0,http://layers.hemav.com 15,36415,0,http://www.sisben.gov.co 16,36416,3794113,http://ialab.com.ar 17,36417,0,http://www.kilimo.com.ar 18,36418,0,https://www.facebook.com/CIRSYS 19,36419,49062,http://www.dymaxionlabs.com 20,36420,1281267,http://cedo.org Regards, Roseline Dr Roseline Antai Research Fellow Hunter Centre for Entrepreneurship Strathclyde Business School University of Strathclyde, Glasgow, UK The University of Strathclyde is a charitable body, registered in Scotland, number SC015263. -Original Message- From: lewis john mcgibbney Sent: 13 December 2021 17:18 To: user@nutch.apache.org Subject: Re: Nutch not crawling all URLs CAUTION: This email originated outside the University. Check before clicking links or attachments. Hi Roseline, Looks like you are ignoring external URLs... that could be the problem right there. I encourage you to track counters on inject, generate and fetch phases to understand where records may be being dropped. Are the seeds you are using public? If so please post your seed file so we can try. Thank you lewismc On Mon, Dec 13, 2021 at 04:02 wrote: > > user Digest 13 Dec 2021 12:02:41 - Issue 3132 > > Topics (messages 34682 through 34682) > > Nutch not crawling all URLs > 34682 by: Roseline Antai > > Administrivia: > > - > To post to the list, e-mail: user@nutch.apache.org To unsubscribe, > e-mail: user-digest-unsubscr...@nutch.apache.org > For additional commands, e-mail: user-digest-h...@nutch.apache.org > > -- > > > > > -- Forwarded message -- > From: Roseline Antai > To: "user@nutch.apache.org" > Cc: > Bcc: > Date: Mon, 13 Dec 2021 12:02:26 +0000 > Subject: Nutch not crawling all URLs > > Hi, > > > > I am working with Apache nutch 1.18 and Solr. I have set up the system > successfully, but I'm now having the problem that Nutch is refusing to > crawl all the URLs. I am now at a loss as to what I should do to > correct this problem. It fetches about half of the URLs in the seed.txt file. > > > > For instance, when I inject 20 URLs, only 9 are fetched. I have made a > number of changes based on the suggestions I saw on the Nutch forum, > as well as on Stack overflow, but nothing seems to work. > > > > This is what my nutch-site.xml file looks like: > > > > > > ** > > ** > > > > ** > > > > ** > > ** > > *http.agent.name > <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhttp > .agent.name%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3 > 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C > 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ > QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Or%2Ft4Sp > S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3Dreserved=0>* > > *Nutch Crawler* > > ** > > ** > > *http.agent.email * > > *datalake.ng at gmail d * > > ** > > ** > > *db.ignore.internal.links* > > *false* > > ** > > ** > > *db.ignore.external.links* > > *true* > > ** > > ** > > * plugin.includes* > > * > protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an >
Re: Nutch not crawling all URLs
Hi Roseline, Looks like you are ignoring external URLs… that could be the problem right there. I encourage you to track counters on inject, generate and fetch phases to understand where records may be being dropped. Are the seeds you are using public? If so please post your seed file so we can try. Thank you lewismc On Mon, Dec 13, 2021 at 04:02 wrote: > > user Digest 13 Dec 2021 12:02:41 - Issue 3132 > > Topics (messages 34682 through 34682) > > Nutch not crawling all URLs > 34682 by: Roseline Antai > > Administrivia: > > - > To post to the list, e-mail: user@nutch.apache.org > To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org > For additional commands, e-mail: user-digest-h...@nutch.apache.org > > -- > > > > > -- Forwarded message -- > From: Roseline Antai > To: "user@nutch.apache.org" > Cc: > Bcc: > Date: Mon, 13 Dec 2021 12:02:26 +0000 > Subject: Nutch not crawling all URLs > > Hi, > > > > I am working with Apache nutch 1.18 and Solr. I have set up the system > successfully, but I’m now having the problem that Nutch is refusing to > crawl all the URLs. I am now at a loss as to what I should do to correct > this problem. It fetches about half of the URLs in the seed.txt file. > > > > For instance, when I inject 20 URLs, only 9 are fetched. I have made a > number of changes based on the suggestions I saw on the Nutch forum, as > well as on Stack overflow, but nothing seems to work. > > > > This is what my nutch-site.xml file looks like: > > > > > > ** > > ** > > > > ** > > > > ** > > ** > > *http.agent.name <http://http.agent.name>* > > *Nutch Crawler* > > ** > > ** > > *http.agent.email * > > *datalake.ng at gmail d * > > ** > > ** > > *db.ignore.internal.links* > > *false* > > ** > > ** > > *db.ignore.external.links* > > *true* > > ** > > ** > > * plugin.includes* > > * > protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier* > > ** > > ** > > *parser.skip.truncated* > > *false* > > *Boolean value for whether we should skip parsing for > truncated documents. By default this* > > *property is activated due to extremely high levels of CPU which > parsing can sometimes take.* > > ** > > ** > > ** > > * db.max.outlinks.per.page > <http://db.max.outlinks.per.page>* > > * -1* > > * The maximum number of outlinks that we'll process for a > page.* > > * If this value is nonnegative (>=0), at most db.max.outlinks.per.page > <http://db.max.outlinks.per.page> outlinks* > > * will be processed for a page; otherwise, all outlinks will be > processed.* > > * * > > ** > > ** > > * http.content.limit* > > * -1* > > * The length limit for downloaded content using the http://* > > * protocol, in bytes. If this value is nonnegative (>=0), content longer* > > * than it will be truncated; otherwise, no truncation at all. Do not* > > * confuse this setting with the file.content.limit setting.* > > * * > > ** > > ** > > * db.ignore.external.links.mode* > > * byDomain* > > ** > > ** > > * db.injector.overwrite* > > * true* > > ** > > ** > > * http.timeout* > > * 5* > > * The default network timeout, in > milliseconds.* > > ** > > ** > > > > Other changes I have made include changing the following in > nutch-default.xml: > > > > *property>* > > * http.redirect.max* > > * 2* > > * The maximum number of redirects the fetcher will follow > when* > > * trying to fetch a page. If set to negative or 0, fetcher won't > immediately* > > * follow redirected URLs, instead it will record them for later fetching.* > > * * > > ** > > > > > > ** > > * ftp.timeout* > > * 10* > > ** > > > > ** > > * ftp.server.timeout* > > * 15* > > ** > > > > *** > > > > *property>* > > * fetcher.server.delay* > > * 65.0* > > ** > > > > ** > > * fetcher.server.min.delay* > > * 25.0* > > ** > > > > ** > > * fetcher.max.crawl.delay* > > * 70* > > * * > > > > I also commented out the line below in the regex-urlfilter file: > > > > *# skip URLs containing certain characters as probable queries, etc.* > > *-[?*!@=]* > > > > Nothing seems to work. > > > > What is it that I’m not doing, or doing wrongly here? > > > > Regards, > > Roseline > > > > *Dr Roseline Antai* > > *Research Fellow* > > Hunter Centre for Entrepreneurship > > Strathclyde Business School > > University of Strathclyde, Glasgow, UK > > > > [image: Small eMail Sig] > > The University of Strathclyde is a charitable body, registered in > Scotland, number SC015263. > > > > > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: Nutch not crawling all URLs
Hi, (looping back to user@nutch - sorry, pressed the wrong reply button) > Some URLs were denied by robots.txt, > while a few failed with: Http code=403 That's two ways to signalize that these pages shouldn't be crawled, HTTP 403 means "Forbidden". > 3. I looked in CrawlDB and most URLs are in there, but were not > crawled, so this is something that I find very confusing. The CrawlDb contains also URLs which failed for various reasons. That's important in order to avoid that 404s, 403s etc. are retried again and again. > I also ran some of the URLs that were not crawled through this - > bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl > > Some of the URLs that failed were parsed successfully, so I'm really > confused as to why there are no results for them. > The "HTTP 403 Forbidden" could be from a "anti-bot protection" software. If you run parsechecker at a different time or from a different machine, and not repeatedly or too often it may succeed. Best, Sebastian On 12/13/21 17:48, Roseline Antai wrote: > Hi Sebastian, > > Thank you for your reply. > > 1. All URLs were injected, so 20 in total. None was rejected. > > 2. I've had a look at the log files and I can see that some of the URLs could > not be fetched because the robot.txt file could not be found. Would this be a > reason for why the fetch failed? Is there a way to go around it? > > Some URLs were denied by robots.txt, while a few failed with: Http code=403 > > 3. I looked in CrawlDB and most URLs are in there, but were not crawled, so > this is something that I find very confusing. > > I also ran some of the URLs that were not crawled through this - bin/nutch > parsechecker -followRedirects -checkRobotsTxt https://myUrl > > Some of the URLs that failed were parsed successfully, so I'm really confused > as to why there are no results for them. > > Do you have any suggestions on what I should try? > > Dr Roseline Antai > Research Fellow > Hunter Centre for Entrepreneurship > Strathclyde Business School > University of Strathclyde, Glasgow, UK > > > The University of Strathclyde is a charitable body, registered in Scotland, > number SC015263. > > > -Original Message- > From: Sebastian Nagel > Sent: 13 December 2021 12:19 > To: Roseline Antai > Subject: Re: Nutch not crawling all URLs > > CAUTION: This email originated outside the University. Check before clicking > links or attachments. > > Hi Roseline, > >> For instance, when I inject 20 URLs, only 9 are fetched. > > Are there any log messages about the 11 unfetched URLs in the log files. Try > to look for a file "hadoop.log" > (usually in $NUTCH_HOME/logs/) and look > 1. how many URLs have been injected. > There should be a log message > ... Total new urls injected: ... > 2. If all 20 URLs are injected, there should be log > messages about these URLs from the fetcher: > FetcherThread ... fetching ... > If the fetch fails, there might be a message about > this. > 3. Look into the CrawlDb for the missing URLs. > bin/nutch readdb .../crawldb -url > or > bin/nutch readdb .../crawldb -dump ... > You get the command-line options by calling > bin/nutch readdb > without any arguments > > Alternatively, verify fetching and parsing the URLs by > bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl > > >> >> db.ignore.external.links >> true >> > > Eventually, you want to follow redirects anyway? See > > > db.ignore.also.redirects > true > If true, the fetcher checks redirects the same way as > links when ignoring internal or external links. Set to false to > follow redirects despite the values for db.ignore.external.links and > db.ignore.internal.links. > > > > Best, > Sebastian > > > On 12/13/21 13:02, Roseline Antai wrote: >> Hi, >> >> >> >> I am working with Apache nutch 1.18 and Solr. I have set up the system >> successfully, but I’m now having the problem that Nutch is refusing to >> crawl all the URLs. I am now at a loss as to what I should do to >> correct this problem. It fetches about half of the URLs in the seed.txt file. >> >> >> >> For instance, when I inject 20 URLs, only 9 are fetched. I have made a >> number of changes based on the suggestions I saw on the Nutch forum, >> as well as on Stack overflow, but nothing seems to work. >> >> >> >> This is what my nutch-site.xml file looks like: >> >> >> >> >> >>
Re: Nutch not crawling all URLs
I don't know how I joined this mailing list but please take me off of this list, I have not used Nutch for a long time. Thanks! On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai wrote: > Hi, > > > > I am working with Apache nutch 1.18 and Solr. I have set up the system > successfully, but I’m now having the problem that Nutch is refusing to > crawl all the URLs. I am now at a loss as to what I should do to correct > this problem. It fetches about half of the URLs in the seed.txt file. > > > > For instance, when I inject 20 URLs, only 9 are fetched. I have made a > number of changes based on the suggestions I saw on the Nutch forum, as > well as on Stack overflow, but nothing seems to work. > > > > This is what my nutch-site.xml file looks like: > > > > > > ** > > ** > > > > ** > > > > ** > > ** > > *http.agent.name <http://http.agent.name>* > > *Nutch Crawler* > > ** > > ** > > *http.agent.email * > > *datalake.ng at gmail d * > > ** > > ** > > *db.ignore.internal.links* > > *false* > > ** > > ** > > *db.ignore.external.links* > > *true* > > ** > > ** > > * plugin.includes* > > * > protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier* > > ** > > ** > > *parser.skip.truncated* > > *false* > > *Boolean value for whether we should skip parsing for > truncated documents. By default this* > > *property is activated due to extremely high levels of CPU which > parsing can sometimes take.* > > ** > > ** > > ** > > * db.max.outlinks.per.page > <http://db.max.outlinks.per.page>* > > * -1* > > * The maximum number of outlinks that we'll process for a > page.* > > * If this value is nonnegative (>=0), at most db.max.outlinks.per.page > <http://db.max.outlinks.per.page> outlinks* > > * will be processed for a page; otherwise, all outlinks will be > processed.* > > * * > > ** > > ** > > * http.content.limit* > > * -1* > > * The length limit for downloaded content using the http://* > > * protocol, in bytes. If this value is nonnegative (>=0), content longer* > > * than it will be truncated; otherwise, no truncation at all. Do not* > > * confuse this setting with the file.content.limit setting.* > > * * > > ** > > ** > > * db.ignore.external.links.mode* > > * byDomain* > > ** > > ** > > * db.injector.overwrite* > > * true* > > ** > > ** > > * http.timeout* > > * 5* > > * The default network timeout, in > milliseconds.* > > ** > > ** > > > > Other changes I have made include changing the following in > nutch-default.xml: > > > > *property>* > > * http.redirect.max* > > * 2* > > * The maximum number of redirects the fetcher will follow > when* > > * trying to fetch a page. If set to negative or 0, fetcher won't > immediately* > > * follow redirected URLs, instead it will record them for later fetching.* > > * * > > ** > > > > > > ** > > * ftp.timeout* > > * 10* > > ** > > > > ** > > * ftp.server.timeout* > > * 15* > > ** > > > > *** > > > > *property>* > > * fetcher.server.delay* > > * 65.0* > > ** > > > > ** > > * fetcher.server.min.delay* > > * 25.0* > > ** > > > > ** > > * fetcher.max.crawl.delay* > > * 70* > > * * > > > > I also commented out the line below in the regex-urlfilter file: > > > > *# skip URLs containing certain characters as probable queries, etc.* > > *-[?*!@=]* > > > > Nothing seems to work. > > > > What is it that I’m not doing, or doing wrongly here? > > > > Regards, > > Roseline > > > > *Dr Roseline Antai* > > *Research Fellow* > > Hunter Centre for Entrepreneurship > > Strathclyde Business School > > University of Strathclyde, Glasgow, UK > > > > [image: Small eMail Sig] > > The University of Strathclyde is a charitable body, registered in > Scotland, number SC015263. > > > > >
Nutch not crawling all URLs
Hi, I am working with Apache nutch 1.18 and Solr. I have set up the system successfully, but I'm now having the problem that Nutch is refusing to crawl all the URLs. I am now at a loss as to what I should do to correct this problem. It fetches about half of the URLs in the seed.txt file. For instance, when I inject 20 URLs, only 9 are fetched. I have made a number of changes based on the suggestions I saw on the Nutch forum, as well as on Stack overflow, but nothing seems to work. This is what my nutch-site.xml file looks like: http.agent.name Nutch Crawler http.agent.email datalake.ng at gmail d db.ignore.internal.links false db.ignore.external.links true plugin.includes protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier parser.skip.truncated false Boolean value for whether we should skip parsing for truncated documents. By default this property is activated due to extremely high levels of CPU which parsing can sometimes take. db.max.outlinks.per.page -1 The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. http.content.limit -1 The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. db.ignore.external.links.mode byDomain db.injector.overwrite true http.timeout 5 The default network timeout, in milliseconds. Other changes I have made include changing the following in nutch-default.xml: property> http.redirect.max 2 The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching. ** ftp.timeout 10 ftp.server.timeout 15 * property> fetcher.server.delay 65.0 fetcher.server.min.delay 25.0 fetcher.max.crawl.delay 70 I also commented out the line below in the regex-urlfilter file: # skip URLs containing certain characters as probable queries, etc. -[?*!@=] Nothing seems to work. What is it that I'm not doing, or doing wrongly here? Regards, Roseline Dr Roseline Antai Research Fellow Hunter Centre for Entrepreneurship Strathclyde Business School University of Strathclyde, Glasgow, UK [Small eMail Sig] The University of Strathclyde is a charitable body, registered in Scotland, number SC015263.
Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin
The issue is now tracked in https://issues.apache.org/jira/browse/NUTCH-2907 On 10/28/21 15:31, Sebastian Nagel wrote: > Hi Shi Wei, > > sorry, but it looks like the Selenium protocol plugin has never been > used with a proxy over https. There are two points which need (at a > first glance) a rework: > > 1. the protocol tries to establish a TLS/SSL connection to the proxy if > the URL to be crawled is a https:// URL. There might be some proxies > which can do this, but the proxies I'm aware of expect a HTTP CONNECT > [1] for HTTPS proxying. > > 2. probably also the browser / driver needs to be configured to > use the same proxy. Afaics, this isn't done but is a requirement > if the proxy is required for accessing web content. However, it > might be possible by setting environment variables. > > Sorry again. Feel free to open a Jira issue to get this fixed. > > Best, > Sebastian > > [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method > > > On 10/28/21 11:45, sw.l...@quandatics.com wrote: >> Hi there, >> >> >> >> Good day! >> >> >> >> We would like to crawl the web data by executing the Nutch with Selenium >> plugin with the following command: >> >> >> >> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http >> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial >> >> >> >> However, it failed with the following error message: >> >> >> >> 2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list = >> true >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 1 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch >> Test/Nutch-1.18 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language = >> en-us,en-gb,en;q=0.7,*;q=0.3 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept = >> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 >> >> 2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header = >> true >> >> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output >> >> javax.net.ssl.SSLHandshakeException: Remote host closed connection during >> handshake >> >> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994) >> >> at sun.security.ssl.SSL >> >> >> >> FYI, we have tried the following approaches but the issues persisted. >> >> >> >> 1. Set the http.tls.certificates.check to false >> >> 2. Import the website's certificates to our java truststores >> >> 3. Our Nutch is configured with proxy >> >> >> >> Kindly advise. Thanks in advance! >> >> >> >> >> >> Best Regards, >> >> Shi Wei >> >> >> >>
Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin
Hi Shi Wei, sorry, but it looks like the Selenium protocol plugin has never been used with a proxy over https. There are two points which need (at a first glance) a rework: 1. the protocol tries to establish a TLS/SSL connection to the proxy if the URL to be crawled is a https:// URL. There might be some proxies which can do this, but the proxies I'm aware of expect a HTTP CONNECT [1] for HTTPS proxying. 2. probably also the browser / driver needs to be configured to use the same proxy. Afaics, this isn't done but is a requirement if the proxy is required for accessing web content. However, it might be possible by setting environment variables. Sorry again. Feel free to open a Jira issue to get this fixed. Best, Sebastian [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method On 10/28/21 11:45, sw.l...@quandatics.com wrote: > Hi there, > > > > Good day! > > > > We would like to crawl the web data by executing the Nutch with Selenium > plugin with the following command: > > > > $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http > https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial > > > > However, it failed with the following error message: > > > > 2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list = > true > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 1 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch > Test/Nutch-1.18 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language = > en-us,en-gb,en;q=0.7,*;q=0.3 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept = > text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 > > 2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header = > true > > 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output > > javax.net.ssl.SSLHandshakeException: Remote host closed connection during > handshake > > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994) > > at sun.security.ssl.SSL > > > > FYI, we have tried the following approaches but the issues persisted. > > > > 1. Set the http.tls.certificates.check to false > > 2. Import the website's certificates to our java truststores > > 3. Our Nutch is configured with proxy > > > > Kindly advise. Thanks in advance! > > > > > > Best Regards, > > Shi Wei > > > >
javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin
Hi there, Good day! We would like to crawl the web data by executing the Nutch with Selenium plugin with the following command: $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial However, it failed with the following error message: 2021-10-26 19:07:53,961 INFO selenium.Http - http.proxy.host = xxx.xx.xx.xx 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.port = 2021-10-26 19:07:53,962 INFO selenium.Http - http.proxy.exception.list = true 2021-10-26 19:07:53,962 INFO selenium.Http - http.timeout = 1 2021-10-26 19:07:53,962 INFO selenium.Http - http.content.limit = 1048576 2021-10-26 19:07:53,962 INFO selenium.Http - http.agent = Apache Nutch Test/Nutch-1.18 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept.language = en-us,en-gb,en;q=0.7,*;q=0.3 2021-10-26 19:07:53,962 INFO selenium.Http - http.accept = text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8 2021-10-26 19:07:53,962 INFO selenium.Http - http.enable.cookie.header = true 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output javax.net.ssl.SSLHandshakeException: Remote host closed connection during handshake at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994) at sun.security.ssl.SSL FYI, we have tried the following approaches but the issues persisted. 1. Set the http.tls.certificates.check to false 2. Import the website's certificates to our java truststores 3. Our Nutch is configured with proxy Kindly advise. Thanks in advance! Best Regards, Shi Wei
Re: Cant integrate the kerberos enabled solr cloud with nutch
Hi Sebastian, Thanks for your reply. We understand that the current indexer-solr plugin only supports the basics authentication. We will open the jira issue accordingly. Usually how long it would take for the jira to resolve? Besides, is there any other workaround that we could try for the described issue? It's because upon our research in these few days, it seems that somebody were able to connect to the Solr with Kerberos authentication via the NTLM scheme of HttpAuthenticationSchemes. Could you assist to check whether this would work? https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=115512111#content/view/115512111 <https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=115512111#content/view/115512111> Your Sincerely, Shi Wei > On 22 Oct 2021, at 6:55 PM, Sebastian Nagel > wrote: > > Hi Shi Wei, > > > kerberos > > sorry, I missed this detail. The plugin indexer-solr for now > only supports basic authentication. > > Could you open a Jira issue to get Kerberos authentication > implemented on the Nutch site? > https://issues.apache.org/jira/projects/NUTCH > > See also: > > https://solr.apache.org/guide/8_5/kerberos-authentication-plugin.html#using-solrj-with-a-kerberized-solr > > Thanks, > Sebastian > > On 10/22/21 12:01 PM, sw.l...@quandatics.com wrote: >> Hi Sebastian, >> Here is the index-writers.xml you requested. Thank >> Your Sincerely, >> Shi Wei >> -Original Message- >> From: Sebastian Nagel >> Sent: Friday, 22 October, 2021 5:46 PM >> To: user@nutch.apache.org >> Subject: Re: Cant integrate the kerberos enabled solr cloud with nutch >> Hi Shi Wei, >> could you also share the index writer configuration (conf/index-writers.xml)? >> The default is unauthenticated access to Solr, see the snippet below. >> The file httpclient-auth.xml is not relevant for the Solr indexer, it's used >> if a crawled web site requires authentication in order to fetch the content >> via the plugin protocol-httpclient. >> Best, >> Sebastian >>> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter"> >> >> >>http://localhost:8983/solr/nutch"/> >> >> >> >> >> >> >> On 10/22/21 10:10 AM, sw.l...@quandatics.com wrote: >>> Hi, >>> >>> We have encountered a problem which can’t integrate the kerberos enabled >>> solr cloud with nutch. >>> >>> When execute "nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 >>> -filter -normalize" command ,it will fail with "HTTP ERROR 401Problem >>> accessing /solr/admin/collections. Reason:Authentication required" but we >>> able to curl it with the keytab. >>> >>> Version of Nutch :1.18 >>> >>> Your Sincerely, >>> >>> Shi Wei >>> >
RE: Cant integrate the kerberos enabled solr cloud with nutch
Hi Sebastian, Here is the index-writers.xml you requested. Thank Your Sincerely, Shi Wei -Original Message- From: Sebastian Nagel Sent: Friday, 22 October, 2021 5:46 PM To: user@nutch.apache.org Subject: Re: Cant integrate the kerberos enabled solr cloud with nutch Hi Shi Wei, could you also share the index writer configuration (conf/index-writers.xml)? The default is unauthenticated access to Solr, see the snippet below. The file httpclient-auth.xml is not relevant for the Solr indexer, it's used if a crawled web site requires authentication in order to fetch the content via the plugin protocol-httpclient. Best, Sebastian http://localhost:8983/solr/nutch"/> On 10/22/21 10:10 AM, sw.l...@quandatics.com wrote: > Hi, > > We have encountered a problem which can’t integrate the kerberos enabled solr > cloud with nutch. > > When execute "nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 > -filter -normalize" command ,it will fail with "HTTP ERROR 401Problem > accessing /solr/admin/collections. Reason:Authentication required" but we > able to curl it with the keytab. > > Version of Nutch :1.18 > > Your Sincerely, > > Shi Wei > http://lucene.apache.org/nutch; xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance; xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd"> http://utility.quandatics.tech:8983/solr/"/>
Re: Cant integrate the kerberos enabled solr cloud with nutch
Hi Shi Wei, could you also share the index writer configuration (conf/index-writers.xml)? The default is unauthenticated access to Solr, see the snippet below. The file httpclient-auth.xml is not relevant for the Solr indexer, it's used if a crawled web site requires authentication in order to fetch the content via the plugin protocol-httpclient. Best, Sebastian http://localhost:8983/solr/nutch"/> On 10/22/21 10:10 AM, sw.l...@quandatics.com wrote: Hi, We have encountered a problem which can’t integrate the kerberos enabled solr cloud with nutch. When execute "nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 -filter -normalize" command ,it will fail with "HTTP ERROR 401Problem accessing /solr/admin/collections. Reason:Authentication required" but we able to curl it with the keytab. Version of Nutch :1.18 Your Sincerely, Shi Wei
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Clark, This is a lot of information... thank you for compiling it all. Ideally the version of Hadoop being used with Nutch should ALWAYS match the hadoop binaries referenced in https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run into the classpath issues. I would like to encourage you to create a wiki page so we can document this in a user firnedly way... would you be open to that? You can create an account at https://cwiki.apache.org/confluence/display/NUTCH/Home Thanks for your consideration. lewismc On 2021/07/14 18:27:23, Clark Benham wrote: > Hi All, > > Sebastian Helped fix my issue: using S3 as a backend I was able to get > nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an > oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg. > hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running > `hadoop version` give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0 > jars from the hadoop download. > Also, in the main nutch branch ( > https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently > has dependencies on hadoop-3.1.3; eg. > > conf="*->default"> > > z > > > > > > conf="*->default" /> > rev="3.1.3" conf="*->default" /> > name="hadoop-mapreduce-client-jobclient" rev="3.1.3" conf="*->default" /> > > > I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'. > > I didn't change "mapreduce.job.dir" because there's no namenode nor > datanode processes running when using hadoop with S3, so the UI is blank. > > Copied from Email with Sebastian: > > > The plugin loader doesn't appear to be able to read from s3 in > nutch-1.18 > > > with hadoop-3.2.1[1]. > > > I had a look into the plugin loader: it can only read from the local file > system. > > But that's ok because the Nutch job file is copied to the local machine > > and unpacked. Here the paths how it looks like on one of the running > Common Crawl > > task nodes: > > The configs for the working hadoop are as follows: > > core-site.xml > > > > > > > > > > > > > > > hadoop.tmp.dir > > /home/hdoop/tmpdata > > > > > > fs.defaultFS > > s3a://my-bucket > > > > > > > fs.s3a.access.key > > KEY_PLACEHOLDER > > AWS access key ID. > >Omit for IAM role-based or provider-based authentication. > > > > > > > fs.s3a.secret.key > > SECRET_PLACEHOLDER > > AWS secret key. > >Omit for IAM role-based or provider-based authentication. > > > > > > > fs.s3a.aws.credentials.provider > > > > Comma-separated class names of credential provider classes which > implement > > com.amazonaws.auth.AWSCredentialsProvider. > > > These are loaded and queried in sequence for a valid set of credentials. > > Each listed class must implement one of the following means of > > construction, which are attempted in order: > > 1. a public constructor accepting java.net.URI and > > org.apache.hadoop.conf.Configuration, > > 2. a public static method named getInstance that accepts no > >arguments and returns an instance of > >com.amazonaws.auth.AWSCredentialsProvider, or > > 3. a public default constructor. > > > Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider > allows > > anonymous access to a publicly accessible S3 bucket without any > credentials. > > Please note that allowing anonymous access to an S3 bucket compromises > > security and therefore is unsuitable for most use cases. It can be > useful > > for accessing public data sets without requiring AWS credentials. > > > If unspecified, then the default list of credential provider classes, > > queried in sequence, is: > > 1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider: > >Uses the values of fs.s3a.access.key and fs.s3a.secret.key. > > 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports > > configuration of AWS access key ID and secret access key in > > environment variables named AWS_ACCESS_KEY_ID and > > AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK. > > 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use > > of instance profile credentials if running in an EC2 VM. > > >
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Clark, thanks for summarizing this discussion and sharing the final configuration! Good to know that it's possible to run Nutch on Hadoop using S3A without using HDFS (no namenode/datanodes running). Best, Sebastian
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi All, Sebastian Helped fix my issue: using S3 as a backend I was able to get nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg. hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running `hadoop version` give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0 jars from the hadoop download. Also, in the main nutch branch ( https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently has dependencies on hadoop-3.1.3; eg. z I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'. I didn't change "mapreduce.job.dir" because there's no namenode nor datanode processes running when using hadoop with S3, so the UI is blank. Copied from Email with Sebastian: > > The plugin loader doesn't appear to be able to read from s3 in nutch-1.18 > > with hadoop-3.2.1[1]. > I had a look into the plugin loader: it can only read from the local file system. > But that's ok because the Nutch job file is copied to the local machine > and unpacked. Here the paths how it looks like on one of the running Common Crawl > task nodes: The configs for the working hadoop are as follows: core-site.xml hadoop.tmp.dir /home/hdoop/tmpdata fs.defaultFS s3a://my-bucket fs.s3a.access.key KEY_PLACEHOLDER AWS access key ID. Omit for IAM role-based or provider-based authentication. fs.s3a.secret.key SECRET_PLACEHOLDER AWS secret key. Omit for IAM role-based or provider-based authentication. fs.s3a.aws.credentials.provider Comma-separated class names of credential provider classes which implement com.amazonaws.auth.AWSCredentialsProvider. These are loaded and queried in sequence for a valid set of credentials. Each listed class must implement one of the following means of construction, which are attempted in order: 1. a public constructor accepting java.net.URI and org.apache.hadoop.conf.Configuration, 2. a public static method named getInstance that accepts no arguments and returns an instance of com.amazonaws.auth.AWSCredentialsProvider, or 3. a public default constructor. Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider allows anonymous access to a publicly accessible S3 bucket without any credentials. Please note that allowing anonymous access to an S3 bucket compromises security and therefore is unsuitable for most use cases. It can be useful for accessing public data sets without requiring AWS credentials. If unspecified, then the default list of credential provider classes, queried in sequence, is: 1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider: Uses the values of fs.s3a.access.key and fs.s3a.secret.key. 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports configuration of AWS access key ID and secret access key in environment variables named AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK. 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use of instance profile credentials if running in an EC2 VM. org.apache.hadoop hadoop-client ${hadoop.version} org.apache.hadoop hadoop-aws ${hadoop.version} hadoop-env.sh # # Licensed to the Apache Software Foundation (ASF) under one # omore contributor license agreements. See the NOTICE file # distributed with this work for additional information # regarding copyright ownership. The ASF licenses this file # to you under the Apache License, Version 2.0 (the # "License"); you may not use this file except in compliance #a with the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # Set Hadoop-specific environment variables here. ## ## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS. ## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS. THEREFORE, ## ONE CAN USE THIS FILE TO SET YARN, HDFS, AND MAPREDUCE ## CONFIGURATION OPTIONS INSTEAD OF xxx-env.sh. ## ## Precedence rules: ## ## {yarn-env.sh|hdfs-env.sh} > hadoop-env.sh > hard-coded defaults ## ## {YARN_xyz|HDFS_xyz} > HADOOP_xyz > hard-coded defaults ## # Many of the options here are built from the perspective that users # may want to provide OVERWRITING values on the command line. # For example: # # JAVA_HOME=/usr/java/testing hdfs dfs -ls
Looking for ntesters - Nutch Dockerfile
Hi user@, Are you interested in the Nutch Dockerfile? If so, keep reading. We are looking for some assistance to test proposed additions to the Nutch Dockerfile. Essentially the changes would facilitate installing and running the Nutch REST server and/or the Nutch WebApp in addition to the Nutch server-side installation. How to build and run is all documented in the accompanying README If you are interested, please see https://github.com/apache/nutch/pull/691 ..and comment in the thread. Thanks lewismc
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Sebastian, NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built hadoop. There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to ${fs.defaultFS}, so s3a://temp-crawler in our case. The plugin loader doesn't appear to be able to read from s3 in nutch-1.18 with hadoop-3.2.1[1]. Using java & javac 11 with hadoop-3.3.0 downloaded and untared and a nutch-1.19 I built: I can run a mapreduce job on S3; and a Nutch job on hdfs, but running nutch on S3 still gives "URLNormalizer not found" with the plugin dir on the local filesystem or on S3a. How would you recommend I go about getting the plugin loader to read from other file systems? [1] I still get 'x point org.apache.nutch.net.URLNormalizer not found' (same stack trace as previous email) with `plugin.folders s3a://temp-crawler/user/hdoop/nutch-plugins` set in my nutch-site.xml while `hadoop fs -ls s3a://temp-crawler/user/hdoop/nutch-plugins` lists all the plugins as there. For posterity: I got hadoop-3.3.0 working with a S3 backend by: cd ~/hadoop-3.3.0 cp ./share/hadoop/tools/lib/hadoop-aws-3.3.0.jar ./share/hadoop/common/lib cp ./share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar ./share/hadoop/common/lib to solve "Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not found" despite the class existing in ~/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar checking it's on the classpath with `hadoop classpath | tr ":" "\n" | grep share/hadoop/tools/lib/hadoop-aws-3.3.0.jar` as well as adding it to hadoop-env.sh. see https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f On Tue, Jun 15, 2021 at 2:01 AM Sebastian Nagel wrote: > > The local file system? Or hdfs:// or even s3:// resp. s3a://? > > Also important: the value of "mapreduce.job.dir" - it's usually > on hdfs:// and I'm not sure whether the plugin loader is able to > read from other filesystems. At least, I haven't tried. > > > On 6/15/21 10:53 AM, Sebastian Nagel wrote: > > Hi Clark, > > > > sorry, I should read your mail until the end - you mentioned that > > you downgraded Nutch to run with JDK 8. > > > > Could you share to which filesystem does NUTCH_HOME point? > > The local file system? Or hdfs:// or even s3:// resp. s3a://? > > > > Best, > > Sebastian > > > > > > On 6/15/21 10:24 AM, Clark Benham wrote: > >> Hi, > >> > >> > >> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 > >> backend/filesystem; however I get an error ‘URLNormalizer class not > found’. > >> I have edited nutch-site.xml so this plugin should be included: > >> > >> > >> > >>plugin.includes > >> > >> > >> > protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints > > >> > >> > >> > >> > >> > >> and then built on both nodes (I only have 2 machines). I’ve > successfully > >> run Nutch locally and in distributed mode using HDFS, and I’ve run a > >> mapreduce job with S3 as hadoop’s file system. > >> > >> > >> I thought it was possible nutch is not reading nutch-site.xml because I > >> resolve an error by setting the config through the cli, despite this > >> duplicating nutch-site.xml. > >> > >> The command: > >> > >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job > >> org.apache.nutch.fetcher.Fetcher > >> crawl/crawldb crawl/segments` > >> > >> throws > >> > >> `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' > >> http.agent.name' property` > >> > >> while if I pass a value in for http.agent.name with > >> `-Dhttp.agent.name=myScrapper`, > >> (making the command `hadoop jar > >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job > >> org.apache.nutch.fetcher.Fetcher > >> -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error > >> about there being no input path, which makes sense as I haven’t been > able > >> to generate any segments. > >> > >> > >> However this method of setting nutch config’s doesn’t work for > injecting > >> URLs; eg: > >> > >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job > >> org.apache.nutch.crawl.Injector
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
> The local file system? Or hdfs:// or even s3:// resp. s3a://? Also important: the value of "mapreduce.job.dir" - it's usually on hdfs:// and I'm not sure whether the plugin loader is able to read from other filesystems. At least, I haven't tried. On 6/15/21 10:53 AM, Sebastian Nagel wrote: Hi Clark, sorry, I should read your mail until the end - you mentioned that you downgraded Nutch to run with JDK 8. Could you share to which filesystem does NUTCH_HOME point? The local file system? Or hdfs:// or even s3:// resp. s3a://? Best, Sebastian On 6/15/21 10:24 AM, Clark Benham wrote: Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting plugin.folders to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_01_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken by all map tasks=31598 Total megabyte-milliseconds taken by all map tasks=8089088 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Clark, sorry, I should read your mail until the end - you mentioned that you downgraded Nutch to run with JDK 8. Could you share to which filesystem does NUTCH_HOME point? The local file system? Or hdfs:// or even s3:// resp. s3a://? Best, Sebastian On 6/15/21 10:24 AM, Clark Benham wrote: Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting plugin.folders to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_01_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken by all map tasks=31598 Total megabyte-milliseconds taken by all map tasks=8089088 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed, job status: FAILED, reason: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,562 ERROR craw
Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi Clark, the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like there's something wrong fundamentally, not only with the plugins. > I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 Are you aware that the Nutch 1.19 will require JDK 11? - and the recent Nutch snapshots already do, see NUTCH-2857. Hadoop 3.2.1 does not support JDK 11, you'd need to use 3.3.0. Is a plain vanilla Hadoop used, or a specific Hadoop distribution (eg. Cloudera, Amazon EMR)? Note: the normal way to run Nutch is: $NUTCH_HOME/runtime/deploy/bin/nutch ... But in the end it will also call "hadoop jar apache-nutch-xyz.job ..." Best, Sebastian On 6/15/21 10:24 AM, Clark Benham wrote: Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting plugin.folders to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_01_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken
Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'
Hi, I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3 backend/filesystem; however I get an error ‘URLNormalizer class not found’. I have edited nutch-site.xml so this plugin should be included: plugin.includes protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints and then built on both nodes (I only have 2 machines). I’ve successfully run Nutch locally and in distributed mode using HDFS, and I’ve run a mapreduce job with S3 as hadoop’s file system. I thought it was possible nutch is not reading nutch-site.xml because I resolve an error by setting the config through the cli, despite this duplicating nutch-site.xml. The command: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher crawl/crawldb crawl/segments` throws `java.lang.IllegalArgumentException: Fetcher: No agents listed in ' http.agent.name' property` while if I pass a value in for http.agent.name with `-Dhttp.agent.name=myScrapper`, (making the command `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.fetcher.Fetcher -Dhttp.agent.name=clark crawl/crawldb crawl/segments`), I get an error about there being no input path, which makes sense as I haven’t been able to generate any segments. However this method of setting nutch config’s doesn’t work for injecting URLs; eg: `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector -Dplugin.includes=".*" crawl/crawldb urls` fails with the same “URLNormalizer” not found. I tried copying the plugin dir to S3 and setting plugin.folders to be a path on S3 without success. (I expect the plugin to be bundled with the .job so this step should be unnecessary) The full stack trace for `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job org.apache.nutch.crawl.Injector crawl/crawldb urls`: SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] #Took out multiply Info messages 2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id : attempt_1623740678244_0001_m_01_0, Status : FAILED Error: java.lang.RuntimeException: x point org.apache.nutch.net.URLNormalizer not found. at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145) at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730) at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168) #This error repeats 6 times total, 3 times for each node 2021-06-15 07:06:26,035 INFO mapreduce.Job: map 100% reduce 100% 2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001 failed with state FAILED due to: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14 Job Counters Failed map tasks=7 Killed map tasks=1 Killed reduce tasks=1 Launched map tasks=8 Other local map tasks=6 Rack-local map tasks=2 Total time spent by all maps in occupied slots (ms)=63196 Total time spent by all reduces in occupied slots (ms)=0 Total time spent by all map tasks (ms)=31598 Total vcore-milliseconds taken by all map tasks=31598 Total megabyte-milliseconds taken by all map tasks=8089088 Map-Reduce Framework CPU time spent (ms)=0 Physical memory (bytes) snapshot=0 Virtual memory (bytes) snapshot=0 2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed, job status: FAILED, reason: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector: java.lang.RuntimeException: Injector job did not succeed, job status: FAILED, reason: Task failed task_1623740678244_0001_m_01 Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0 killedReduces: 0 at org.apache.nutch.crawl.Injector.inject(Injecto
Re: About Nutch 1.x Rest API at port 8081
Hi All, I guess the solution was using 0.0.0.0 instead of localhost. command: '/root/nutch/bin/nutch startserver -port 8081 -host 0.0.0.0' https://stackoverflow.com/a/67972504/4907821 Thanks for your help, Gokmen On 2021-06-13 12:37, gokmen.yontem wrote: Hello all, My last question turns out to be actually a docker issue, I understood that with great embarrassment. This time, I'm pretty sure it's a similar issue :) But believe me, I gave lots of effort into this as well. This documentation (https://cwiki.apache.org/confluence/display/NUTCH/Nutch+1.X+RESTAPI) tells me that after I run the server with `bin/nutch startserver` the API starts working on the 8081 port. Within the docker container (I get inside with docker exec -it) I confirm that the API is up and running, responsive to my rest calls, but I cannot reach it out outside of the container. So far I have tried visiting http://localhost:8081/admin on my browser (I have also tried other variations like 127.0.0.1, 0.0.0.0, docker IP, machine IP, etc). Postman all of these URLs as well. Additionally, I have added a frontend application to my docker network, I make sure that Frontend, Solr, and Nutch are in the same docker network, and I have tried to make rest calls to Solr and Nutch services from my frontend application: Solr worked, Nutch didn't work. I have also suspected that there might something wrong with my computer, or with the windows 10 that I'm using, I have tried all of the things above on an AWS server. No luck. Finally, I dig into the docker world and I have tried a bunch of things like exposing ports, different types of networks, etc, It didn't work as well. But since I'm not having an issue with solr API, I thought it could be right to ask this issue to the community. Here's my repo: https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml I asked this question on https://stackoverflow.com/questions/67949442/apache-nutch-doesnt-expose-its-api. Thanks for your help, Gokmen
About Nutch 1.x Rest API at port 8081
Hello all, My last question turns out to be actually a docker issue, I understood that with great embarrassment. This time, I'm pretty sure it's a similar issue :) But believe me, I gave lots of effort into this as well. This documentation (https://cwiki.apache.org/confluence/display/NUTCH/Nutch+1.X+RESTAPI) tells me that after I run the server with `bin/nutch startserver` the API starts working on the 8081 port. Within the docker container (I get inside with docker exec -it) I confirm that the API is up and running, responsive to my rest calls, but I cannot reach it out outside of the container. So far I have tried visiting http://localhost:8081/admin on my browser (I have also tried other variations like 127.0.0.1, 0.0.0.0, docker IP, machine IP, etc). Postman all of these URLs as well. Additionally, I have added a frontend application to my docker network, I make sure that Frontend, Solr, and Nutch are in the same docker network, and I have tried to make rest calls to Solr and Nutch services from my frontend application: Solr worked, Nutch didn't work. I have also suspected that there might something wrong with my computer, or with the windows 10 that I'm using, I have tried all of the things above on an AWS server. No luck. Finally, I dig into the docker world and I have tried a bunch of things like exposing ports, different types of networks, etc, It didn't work as well. But since I'm not having an issue with solr API, I thought it could be right to ask this issue to the community. Here's my repo: https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml I asked this question on https://stackoverflow.com/questions/67949442/apache-nutch-doesnt-expose-its-api. Thanks for your help, Gokmen
Re: Apache Nutch help request for a school project :)
:) On Thu, Jun 10, 2021 at 7:18 AM gokmen.yontem wrote: > Lewis, Sebastian > I can’t thank you enough! Your help is much appreciated. > > Next time I'll follow your advice and use the mailing list, which I > wasn't aware of that. > > Best wishes, > Gorkem > > > On 2021-06-07 20:08, lewis john mcgibbney wrote: > > Yep Sebastian is absolutely correct. I sent you a pull request. > > > > https://github.com/gorkemyontem/nutch/pull/1 > > HTH > > lewismc > > > > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney > > wrote: > > > >> I’ll have a look today. You can always use the mailing list as > >> well. Feel free to post your questions there and we will help you > >> out :) > >> > >> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem > >> wrote: > >> > >>> Hi Lewis, > >>> Sorry to bother you. I've been trying to configure Apache Nutch > >>> for > >>> almost 10 days now and I'm about to give up. I saw that you are > >>> contributing to this project and I thought maybe you can help me. > >>> This is how desperate I am :) > >>> > >>> Here's my repo if you have time: > >>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml > >>> I'm trying to use docker images so there isn't much on the repo/ > >>> > >>> This is my current error: > >>> > >>> nutch| Indexer: java.lang.RuntimeException: Indexing job did > >>> not > >>> succeed, job status:FAILED, reason: NA > >>> nutch| at > >>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150) > >>> nutch| at > >>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291) > >>> nutch| at > >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > >>> nutch| at > >>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300) > >>> > >>> People say that schema.xml could be wrong, but I'm using the most > >>> up to > >>> date one from here > >>> > >> > > > https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml > >>> > >>> Many many thanks! > >>> Best wishes, > >>> Gorkem > >> -- > >> > >> http://home.apache.org/~lewismc/ > >> http://people.apache.org/keys/committer/lewismc > > > > -- > > > > http://home.apache.org/~lewismc/ > > http://people.apache.org/keys/committer/lewismc > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: Apache Nutch help request for a school project :)
Yep Sebastian is absolutely correct. I sent you a pull request. https://github.com/gorkemyontem/nutch/pull/1 HTH lewismc On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney wrote: > I’ll have a look today. You can always use the mailing list as well. Feel > free to post your questions there and we will help you out :) > > On Sun, Jun 6, 2021 at 12:43 gokmen.yontem > wrote: > >> Hi Lewis, >> Sorry to bother you. I've been trying to configure Apache Nutch for >> almost 10 days now and I'm about to give up. I saw that you are >> contributing to this project and I thought maybe you can help me. >> This is how desperate I am :) >> >> Here's my repo if you have time: >> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml >> I'm trying to use docker images so there isn't much on the repo/ >> >> This is my current error: >> >> nutch| Indexer: java.lang.RuntimeException: Indexing job did not >> succeed, job status:FAILED, reason: NA >> nutch| at >> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150) >> nutch| at >> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291) >> nutch| at >> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) >> nutch| at >> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300) >> >> >> People say that schema.xml could be wrong, but I'm using the most up to >> date one from here >> >> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml >> >> >> Many many thanks! >> Best wishes, >> Gorkem >> > -- > http://home.apache.org/~lewismc/ > http://people.apache.org/keys/committer/lewismc > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: Apache Nutch help request for a school project :)
Hi Gorkem, I haven't verified it by trying - but it may be that given your configuration the Solr instance isn't reachable via http://localhost:8983/solr/nutch Inside the Docker network, host names are the same as container names, that is http://solr:8983/solr/nutch might work. Cf. the docker-compose networking documentation: https://docs.docker.com/compose/networking/ In your docker-compose.yaml there is: services: solr: container_name: solr image: 'solr:8.5.2' ports: - '8983:8983' ... nutch: container_name: nutch ... command: '/root/nutch/bin/crawl -i -D solr.server.url=http://localhost:8983/solr/nutch -s urls crawl 1' Please try to fix the URL not in the Sorl URL. Important: you need to configure the Solr URL in the file conf/index-writers.xml unless you're using Nutch 1.14 or below. See https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial#NutchTutorial-SetupSolrforsearch In any case it's important to be able to read the logs (stdout/stderr and the hadoop.log)! I know this isn't trivial when using docker-compose but it will save you a lot of time when searching for errors. If you need help here, please let us know. Best start a separate thread in the Nutch user mailing list. Best, Sebastian On 6/7/21 3:18 PM, lewis john mcgibbney wrote: I’ll have a look today. You can always use the mailing list as well. Feel free to post your questions there and we will help you out :) On Sun, Jun 6, 2021 at 12:43 gokmen.yontem wrote: Hi Lewis, Sorry to bother you. I've been trying to configure Apache Nutch for almost 10 days now and I'm about to give up. I saw that you are contributing to this project and I thought maybe you can help me. This is how desperate I am :) Here's my repo if you have time: https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml I'm trying to use docker images so there isn't much on the repo/ This is my current error: nutch| Indexer: java.lang.RuntimeException: Indexing job did not succeed, job status:FAILED, reason: NA nutch| at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150) nutch| at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291) nutch| at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) nutch| at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300) People say that schema.xml could be wrong, but I'm using the most up to date one from here https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml Many many thanks! Best wishes, Gorkem
Re: Apache Nutch help request for a school project :)
I’ll have a look today. You can always use the mailing list as well. Feel free to post your questions there and we will help you out :) On Sun, Jun 6, 2021 at 12:43 gokmen.yontem wrote: > Hi Lewis, > Sorry to bother you. I've been trying to configure Apache Nutch for > almost 10 days now and I'm about to give up. I saw that you are > contributing to this project and I thought maybe you can help me. > This is how desperate I am :) > > Here's my repo if you have time: > https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml > I'm trying to use docker images so there isn't much on the repo/ > > This is my current error: > > nutch| Indexer: java.lang.RuntimeException: Indexing job did not > succeed, job status:FAILED, reason: NA > nutch| at > org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150) > nutch| at > org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291) > nutch| at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76) > nutch| at > org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300) > > > People say that schema.xml could be wrong, but I'm using the most up to > date one from here > > https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml > > > Many many thanks! > Best wishes, > Gorkem > -- http://home.apache.org/~lewismc/ http://people.apache.org/keys/committer/lewismc
Re: Recommendation for free and production-ready Hadoop setup to run Nutch
Hi Lewis, hi Markus, > snappy compression, which is a massive improvement for large data shuffling jobs Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for longer. We use it for a 25-billion CrawlDB: it's almost as fast (both compression and decompression) as snappy and you get a compression ratio which is not far away from bzip2 (which is very slow). > worker/task nodes were run on spot instances we do the same. However, the EMR is priced at 25% of the on-demand EC2 instance price. As spot prices are usually 50-70% of the on-demand price, the EMR costs would add a non-trivial part. > backup logic Yes. We checkpoint the output of every step to S3 unless it runs less than one hour. > using Terraform or AWS CloudFormation We use a shell script to bootstrap the instances. Some parts have been added/rewritten using CloudFormation (the VPC setup). The long-term plan is to use more templates and also bake a base machine image to speed up the bootstraping. > ARM support ARM instances often (including spot instances) offer a better price/CPU ratio. We've already switched to ARM for a couple of services/tasks - the efforts are minimal: choose the right base image, installing and running Java or Python workflows does not change. But I haven't tried Hadoop yet. Thanks for sharing your experiences, I'll keep you updated about our decisions and progress! Best, Sebastian
Re: Recommendation for free and production-ready Hadoop setup to run Nutch
Does the Apache Bigtop project not meet the requirements of a free distribution? https://github.com/apache/bigtop What is the status of that project?