[ANNOUNCE] Apache Nutch 1.20 Release

2024-04-28 Thread lewis john mcgibbney
The Apache Nutch Project https://nutch.apache.org/download/

Please verify signatures using the KEYS file
https://raw.githubusercontent.com/apache/nutch/master/KEYS when downloading
the release.

This release includes more than 60 bug fixes and improvements, the full
list of changes can be seen in the Jira release report
https://s.apache.org/ovjf3

Thanks to everyone who contributed to this release!

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[RESULT] WAS Re: [VOTE] Apache Nutch 1.20 Release

2024-04-24 Thread lewis john mcgibbney
Hi user@ & dev@,
I’m glad to conclude the Nutch 1.20 release candidate VOTE thread with the
following RESULT’s.

[5] +1 Release this package as Apache Nutch 1.20
snagel*
balakuntala*
blackice*
Joe Gilvary
lewismc*

[ ] -1 Do not release this package because…

*Nutch Project Management Committee-binding

The Nutch 1.20 release candidate has passed the community VOTE. I will
therefore promote this release casndidate.

Thanks for VOTE’ing and for everyone who contributed to the Apache Nutch
1.20 release.

lewismc

On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.20 release is available at [0] where
> accompanying SHA512 and ASC signatures can also be found.
> Information on verifying releases can be found at [1].
>
> The release candidate comprises a .zip and tar.gz archive of the sources
> at [2] and complementary binary distributions. In addition, a staged maven
> repository is available at [3].
>
> The Nutch 1.20 release report is available at [4].
>
> Please vote on releasing this package as Apache Nutch 1.20. The vote is
> open for at least the next 72 hours and passes if a majority of at least
> three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.20.
>
> [ ] -1 Do not release this package because…
>
> Cheers,
> lewismc
> P.S. Here is my +1.
>
> [0] https://dist.apache.org/repos/dist/dev/nutch/1.20
> [1] http://nutch.apache.org/downloads.html#verify
> [2] https://github.com/apache/nutch/tree/release-1.20
> [3]
> https://repository.apache.org/content/repositories/orgapachenutch-1021/
> [4] https://s.apache.org/ovjf3
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: [VOTE] Apache Nutch 1.20 Release

2024-04-16 Thread lewis john mcgibbney
Hi user@, dev@,
Please consider reviewing the Nutch 1.20 release candidate. This is a
critical prerequisite for us making releases of software at TheASF.
Thank you
lewismc

On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.20 release is available at [0] where
> accompanying SHA512 and ASC signatures can also be found.
> Information on verifying releases can be found at [1].
>
> The release candidate comprises a .zip and tar.gz archive of the sources
> at [2] and complementary binary distributions. In addition, a staged maven
> repository is available at [3].
>
> The Nutch 1.20 release report is available at [4].
>
> Please vote on releasing this package as Apache Nutch 1.20. The vote is
> open for at least the next 72 hours and passes if a majority of at least
> three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch X.XX.
>
> [ ] -1 Do not release this package because…
>
> Cheers,
> lewismc
> P.S. Here is my +1.
>
> [0] https://dist.apache.org/repos/dist/dev/nutch/1.20
> [1] http://nutch.apache.org/downloads.html#verify
> [2] https://github.com/apache/nutch/tree/release-1.20
> [3]
> https://repository.apache.org/content/repositories/orgapachenutch-1021/
> [4] https://s.apache.org/ovjf3
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: [VOTE] Apache Nutch 1.20 Release

2024-04-11 Thread Sebastian Nagel

Hi Lewis,

here's my +1

 * signatures of release packages are valid
 * build from the source package successful, unit tests pass
 * tested few Nutch tools in the binary package (local mode)
 * run a sample crawl and tested many Nutch tools on a single-node cluster
   running Hadoop 3.4.0, see
   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/

One note about the CHANGES.md: it's now a mixture of HTML and plain text.
It does not use the potential of markdown, e.g. sections / headlines for
the releases to make the change log navigable via a table of contents.
The embedded HTML makes it less readable if viewed in a text editor.
The rendering on Github [5] is acceptable with only minor glitches,
mostly the placement of multiple lines in a single paragraph:
  https://github.com/apache/nutch/blob/branch-1.20/CHANGES.md
We also have a change log on Jira:
  https://s.apache.org/ovjf3
That's why I wouldn't call the CHANGES.md a "blocker". We should
update the formatting after the release to make it again easily
readable in source code and improve the document structure utilizing
the markdown markup.

~Sebastian

On 4/9/24 23:28, lewis john mcgibbney wrote:

Hi Folks,

A first candidate for the Nutch 1.20 release is available at [0] where 
accompanying SHA512 and ASC signatures can also be found.

Information on verifying releases can be found at [1].

The release candidate comprises a .zip and tar.gz archive of the sources at [2] 
and complementary binary distributions. In addition, a staged maven repository 
is available at [3].


The Nutch 1.20 release report is available at [4].

Please vote on releasing this package as Apache Nutch 1.20. The vote is open for 
at least the next 72 hours and passes if a majority of at least three +1 Nutch 
PMC votes are cast.


[ ] +1 Release this package as Apache Nutch X.XX.

[ ] -1 Do not release this package because…

Cheers,
lewismc
P.S. Here is my +1.

[0] https://dist.apache.org/repos/dist/dev/nutch/1.20 
<https://dist.apache.org/repos/dist/dev/nutch/1.20>
[1] http://nutch.apache.org/downloads.html#verify 
<http://nutch.apache.org/downloads.html#verify>
[2] https://github.com/apache/nutch/tree/release-1.20 
<https://github.com/apache/nutch/tree/release-1.20>
[3] https://repository.apache.org/content/repositories/orgapachenutch-1021/ 
<https://repository.apache.org/content/repositories/orgapachenutch-1021/>

[4] https://s.apache.org/ovjf3 <https://s.apache.org/ovjf3>

--
http://home.apache.org/~lewismc/ <http://home.apache.org/~lewismc/>
http://people.apache.org/keys/committer/lewismc 
<http://people.apache.org/keys/committer/lewismc>


[VOTE] Apache Nutch 1.20 Release

2024-04-09 Thread lewis john mcgibbney
Hi Folks,

A first candidate for the Nutch 1.20 release is available at [0] where
accompanying SHA512 and ASC signatures can also be found.
Information on verifying releases can be found at [1].

The release candidate comprises a .zip and tar.gz archive of the sources at
[2] and complementary binary distributions. In addition, a staged maven
repository is available at [3].

The Nutch 1.20 release report is available at [4].

Please vote on releasing this package as Apache Nutch 1.20. The vote is
open for at least the next 72 hours and passes if a majority of at least
three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch X.XX.

[ ] -1 Do not release this package because…

Cheers,
lewismc
P.S. Here is my +1.

[0] https://dist.apache.org/repos/dist/dev/nutch/1.20
[1] http://nutch.apache.org/downloads.html#verify
[2] https://github.com/apache/nutch/tree/release-1.20
[3] https://repository.apache.org/content/repositories/orgapachenutch-1021/
[4] https://s.apache.org/ovjf3

--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[GSoC 2024 PROPOSAL] Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread lewis john mcgibbney
Hi user@ & dev@,

I decided to write up a GSoC’24 proposal and encourage interested
applicants to register your interest in the JIRA issue or else reach
out to the Nutch PMC over on d...@nutch.apache.org (please CC
lewi...@apache.org).

Title: Overhaul the legacy Nutch plugin framework and replace it with PF4J
JIRA: https://issues.apache.org/jira/browse/NUTCH-3034

Thanks in advance, and good luck to prospective GSoC applicants.

lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: nutch adds %20 in urls instead of spaces

2024-01-10 Thread Steve Cohen
Thanks for the response Markus. disabling urlnormalizer-basic works.

On Tue, Jan 9, 2024 at 3:43 PM Markus Jelsma 
wrote:

> Hello Steve,
>
> Having those spaces normalized/encoded is expected behaviour with
> urlnormalizer-basic active. I would recommend to keep it this way and have
> all URLs in Solr properly encoded. Having spaces in Solr IDs is also not
> recommended as it can lead to unexpected behaviour.
>
> If you really don't want them encoded, disable urlnormalizer-basic in your
> configuration.
>
> Regards,
> Markus
>
> Op di 9 jan 2024 om 19:20 schreef Steve Cohen :
>
> > Hello,
> >
> > I am updating a nutch crawl that read files in directories that have
> > spaces. The urls show %20 instead of spaces. This doesn't seem to be what
> > the behavior was in the past.
> >
> > In nutch 1.10 I get these results
> >
> > Nutch 1.10
> >
> >
> >
> > ParseData::
> > Version: 5
> > Status: success(1,0)
> > Title: Index of /nycor/10-15-2018 and on - Scanned
> > Outlinks: 4
> >   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
> > 2018/
> >   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
> > 2019/
> >   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
> > 2022/
> >   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
> > Unknown/ anchor: Shipment Date Unknown/
> >
> > in Nutch 1.19, I get this
> >
> >
> > ParseData::
> > Version: 5
> > Status: success(1,0)
> > Title: Index of /nycor/10-15-2018 and on - Scanned
> > Outlinks: 4
> >   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
> > anchor: 2018/
> >   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
> > anchor: 2019/
> >   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
> > anchor: 2022/
> >   outlink: toUrl:
> >
> file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
> > anchor: Shipment Date Unknown/
> >
> > We are uploading to solr and the links aren't right with the %20s in the
> > url. How do I remove the %20s?
> >
> > Thanks,
> > Steve Cohen
> >
>


Re: nutch adds %20 in urls instead of spaces

2024-01-09 Thread Markus Jelsma
Hello Steve,

Having those spaces normalized/encoded is expected behaviour with
urlnormalizer-basic active. I would recommend to keep it this way and have
all URLs in Solr properly encoded. Having spaces in Solr IDs is also not
recommended as it can lead to unexpected behaviour.

If you really don't want them encoded, disable urlnormalizer-basic in your
configuration.

Regards,
Markus

Op di 9 jan 2024 om 19:20 schreef Steve Cohen :

> Hello,
>
> I am updating a nutch crawl that read files in directories that have
> spaces. The urls show %20 instead of spaces. This doesn't seem to be what
> the behavior was in the past.
>
> In nutch 1.10 I get these results
>
> Nutch 1.10
>
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
> 2018/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
> 2019/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
> 2022/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
> Unknown/ anchor: Shipment Date Unknown/
>
> in Nutch 1.19, I get this
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
> anchor: 2018/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
> anchor: 2019/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
> anchor: 2022/
>   outlink: toUrl:
> file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
> anchor: Shipment Date Unknown/
>
> We are uploading to solr and the links aren't right with the %20s in the
> url. How do I remove the %20s?
>
> Thanks,
> Steve Cohen
>


Re: nutch adds %20 in urls instead of spaces

2024-01-09 Thread Jim Anderson
unsubscribe

On Tue, Jan 9, 2024 at 1:20 PM Steve Cohen  wrote:

> Hello,
>
> I am updating a nutch crawl that read files in directories that have
> spaces. The urls show %20 instead of spaces. This doesn't seem to be what
> the behavior was in the past.
>
> In nutch 1.10 I get these results
>
> Nutch 1.10
>
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
> 2018/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
> 2019/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
> 2022/
>   outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
> Unknown/ anchor: Shipment Date Unknown/
>
> in Nutch 1.19, I get this
>
>
> ParseData::
> Version: 5
> Status: success(1,0)
> Title: Index of /nycor/10-15-2018 and on - Scanned
> Outlinks: 4
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
> anchor: 2018/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
> anchor: 2019/
>   outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
> anchor: 2022/
>   outlink: toUrl:
> file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
> anchor: Shipment Date Unknown/
>
> We are uploading to solr and the links aren't right with the %20s in the
> url. How do I remove the %20s?
>
> Thanks,
> Steve Cohen
>


nutch adds %20 in urls instead of spaces

2024-01-09 Thread Steve Cohen
Hello,

I am updating a nutch crawl that read files in directories that have
spaces. The urls show %20 instead of spaces. This doesn't seem to be what
the behavior was in the past.

In nutch 1.10 I get these results

Nutch 1.10



ParseData::
Version: 5
Status: success(1,0)
Title: Index of /nycor/10-15-2018 and on - Scanned
Outlinks: 4
  outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2018/ anchor:
2018/
  outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2019/ anchor:
2019/
  outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/2022/ anchor:
2022/
  outlink: toUrl: file:/nycor/10-15-2018 and on - Scanned/Shipment Date
Unknown/ anchor: Shipment Date Unknown/

in Nutch 1.19, I get this


ParseData::
Version: 5
Status: success(1,0)
Title: Index of /nycor/10-15-2018 and on - Scanned
Outlinks: 4
  outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2018/
anchor: 2018/
  outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2019/
anchor: 2019/
  outlink: toUrl: file:/nycor/10-15-2018%20and%20on%20-%20Scanned/2022/
anchor: 2022/
  outlink: toUrl:
file:/nycor/10-15-2018%20and%20on%20-%20Scanned/Shipment%20Date%20Unknown/
anchor: Shipment Date Unknown/

We are uploading to solr and the links aren't right with the %20s in the
url. How do I remove the %20s?

Thanks,
Steve Cohen


Re: Nutch - Restriction by content type

2023-11-16 Thread Markus Jelsma
Hello,

You can skip certain types of documents based on their file extension,
using the urlfilter-suffix. It only filters known suffixes. Filtering based
on content type is not possible, because to know the content type requires
fetching and parsing them.

You can skip specific content types when indexing using the Jexl indexing
filter.

Regards,
Markus

Op do 16 nov 2023 om 14:56 schreef Raj Chidara :

> Hello
>   Can we control crawling of web pages by its content type through any
> configuration setting?  For example, I want to crawl only pages whose
> content type is text/html from a website and does not want to crawl other
> pages/files.
>
>
>
> Thanks and Regards
>
> Raj Chidara
>
>
>
>
>
> Worldwide Offices:
>
> USA | UK | India | Singapore | Japan
>
> *ISO 9001, 27001, 2 Compliant
>
>
>
> www.DDIsmart.com
>
>
>
>
>
>
>
>
>
>
>
>
>
> DISCLAIMER: This message is intended solely for the use of the individual
> or entity to which it is addressed. If you are not the intended recipient,
> you should not use, copy, alter, or disclose the contents of this message.
> All information or opinions expressed in this message and/or any
> attachments are those of the author and are not necessarily those of the
> group companies.
>
>
>
>
>
>
>


Nutch - Restriction by content type

2023-11-16 Thread Raj Chidara
Hello 
  Can we control crawling of web pages by its content type through any 
configuration setting?  For example, I want to crawl only pages whose content 
type is text/html from a website and does not want to crawl other pages/files.



Thanks and Regards

Raj Chidara


 
 
 
Worldwide Offices:

USA | UK | India | Singapore | Japan

*ISO 9001, 27001, 2 Compliant



www.DDIsmart.com


 
 
 

 
 
 
 
 
 
 
DISCLAIMER: This message is intended solely for the use of the individual or 
entity to which it is addressed. If you are not the intended recipient, you 
should not use, copy, alter, or disclose the contents of this message. All 
information or opinions expressed in this message and/or any attachments are 
those of the author and are not necessarily those of the group companies.


 
 
 
 


Re: [DISCUSS] Removing Any23 from Nutch?

2023-09-14 Thread lewis john mcgibbney
+1 Tim.


On Wed, Sep 13, 2023 at 16:50 

>
>
>
> -- Forwarded message --
> From: Tim Allison 
> To: user@nutch.apache.org, d...@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 13 Sep 2023 10:50:08 -0400
> Subject: [DISCUSS] Removing Any23 from Nutch?
> All,
>   I opened https://issues.apache.org/jira/browse/NUTCH-2998 a few weeks
> ago.  Any23 was moved to the attic in June. Unless there are objections, I
> propose removing it from Nutch before the next release.
>   Any objections?
>
>Best,
>
>Tim
>


[DISCUSS] Removing Any23 from Nutch?

2023-09-13 Thread Tim Allison
All,
  I opened https://issues.apache.org/jira/browse/NUTCH-2998 a few weeks
ago.  Any23 was moved to the attic in June. Unless there are objections, I
propose removing it from Nutch before the next release.
  Any objections?

   Best,

   Tim


Re: Nutch Exception

2023-07-24 Thread Markus Jelsma
Hello,

Please check the logs for more information.

Regards,
Markus

Op ma 24 jul 2023 om 19:05 schreef Raj Chidara :

> Hi
>
>   Nutch 1.19 compiled with ant without any errors and when running
> Injector, getting an error that
>
>
>
> 19:20:25.055 [main] ERROR org.apache.nutch.crawl.Injector - Injector job
> did not succeed, job id: job_local952809651_0001, job status: FAILED,
> reason: NA
>
> Exception in thread "main" java.lang.RuntimeException: Injector job did
> not succeed, job id: job_local952809651_0001, job status: FAILED, reason: NA
>
> at org.apache.nutch.crawl.Injector.inject(Injector.java:442)
>
> at org.apache.nutch.crawl.Injector.inject(Injector.java:365)
>
> at org.apache.nutch.crawl.Injector.inject(Injector.java:360)
>
> at org.apache.nutch.crawl.Crawl.run(Crawl.java:249)
>
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)
>
> at org.apache.nutch.crawl.Crawl.main(Crawl.java:146)
>
>
>
>
>
> Thanks and Regards
>
> Raj Chidara


Nutch Exception

2023-07-24 Thread Raj Chidara
Hi 

  Nutch 1.19 compiled with ant without any errors and when running Injector, 
getting an error that



19:20:25.055 [main] ERROR org.apache.nutch.crawl.Injector - Injector job did 
not succeed, job id: job_local952809651_0001, job status: FAILED, reason: NA

Exception in thread "main" java.lang.RuntimeException: Injector job did not 
succeed, job id: job_local952809651_0001, job status: FAILED, reason: NA

    at org.apache.nutch.crawl.Injector.inject(Injector.java:442)

    at org.apache.nutch.crawl.Injector.inject(Injector.java:365)

    at org.apache.nutch.crawl.Injector.inject(Injector.java:360)

    at org.apache.nutch.crawl.Crawl.run(Crawl.java:249)

    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:81)

    at org.apache.nutch.crawl.Crawl.main(Crawl.java:146)





Thanks and Regards

Raj Chidara

Nutch 1.19 in eclipse

2023-07-22 Thread Raj Chidara

HI

I am following instructions as given here to run Nutch 1.19 in eclipse 
(2022-03) with java version 11.0.19


https://cwiki.apache.org/confluence/display/NUTCH/RunNutchInEclipse#RunNutchInEclipse-Beforeyoustart

Created project is giving build errors The package org.w3c.dom is accessible 
from more than one module: , java.xml  and not able to continue next 
steps.  Please help me in this regarding to this problem



Thanks and Regards

Raj Chidara




Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Tim Allison
Thank you, all!  I’m thrilled to join the team!

On Thu, Jul 20, 2023 at 9:42 AM Julien Nioche 
wrote:

> What a fantastic addition to the Nutch team! Congrats to Tim
>
> On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel  wrote:
>
>> Dear all,
>>
>> It is my pleasure to announce that Tim Allison has joined us
>> as a committer and member of the Nutch PMC.
>>
>> You may already know Tim as a maintainer of and contributor to
>> Apache Tika. So, it was great to see contributions to the
>> Nutch source code from an experienced developer who is also
>> active in a related Apache project. Among other contributions
>> Tim recently implemented the indexer-opensearch plugin.
>>
>> Thank you, Tim Allison, and congratulations on your new role
>> in the Apache Nutch community! And welcome on board!
>>
>> Sebastian
>> (on behalf of the Nutch PMC)
>
>
>>
>
> --
>
> *Open Source Solutions for Text Engineering*
>
> http://www.digitalpebble.com
> http://digitalpebble.blogspot.com/
> #digitalpebble <http://twitter.com/digitalpebble>
>


Re: [ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Julien Nioche
What a fantastic addition to the Nutch team! Congrats to Tim

On Thu, 20 Jul 2023 at 10:20, Sebastian Nagel  wrote:

> Dear all,
>
> It is my pleasure to announce that Tim Allison has joined us
> as a committer and member of the Nutch PMC.
>
> You may already know Tim as a maintainer of and contributor to
> Apache Tika. So, it was great to see contributions to the
> Nutch source code from an experienced developer who is also
> active in a related Apache project. Among other contributions
> Tim recently implemented the indexer-opensearch plugin.
>
> Thank you, Tim Allison, and congratulations on your new role
> in the Apache Nutch community! And welcome on board!
>
> Sebastian
> (on behalf of the Nutch PMC)
>


-- 

*Open Source Solutions for Text Engineering*

http://www.digitalpebble.com
http://digitalpebble.blogspot.com/
#digitalpebble <http://twitter.com/digitalpebble>


[ANNOUNCE] New Nutch committer and PMC - Tim Allison

2023-07-20 Thread Sebastian Nagel

Dear all,

It is my pleasure to announce that Tim Allison has joined us
as a committer and member of the Nutch PMC.

You may already know Tim as a maintainer of and contributor to
Apache Tika. So, it was great to see contributions to the
Nutch source code from an experienced developer who is also
active in a related Apache project. Among other contributions
Tim recently implemented the indexer-opensearch plugin.

Thank you, Tim Allison, and congratulations on your new role
in the Apache Nutch community! And welcome on board!

Sebastian
(on behalf of the Nutch PMC)


Re: Nutch 1.19 Getting Error: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'

2023-05-15 Thread Sebastian Nagel

Hi Eric,

unfortunately, on Windows you also need to download and install winutils.exe and 
hadoop.dll,

see
  https://github.com/cdarlint/winutils and

https://stackoverflow.com/questions/41851066/exception-in-thread-main-java-lang-unsatisfiedlinkerror-org-apache-hadoop-io

The installation of Hadoop is not mandatory - the Nutch binary package
already includes Hadoop jar files.

Alternatively, you may prefer to run Nutch on Linux - no additional 
installations required.


Best,
Sebastian

On 5/15/23 04:07, Eric Valencia wrote:

Hello everyone,

So, I set up Nutch 1.19, Solr 8.11.2, and hadoop 3.3.5, to the best of my
knowledge.

After, I went into the nutch directory and ran this command:
*bin/nutch generate crawl/crawldb crawl/segments*

Then, I got an error:
*Exception in thread "main" java.lang.UnsatisfiedLinkError: 'boolean
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String,
int)'*

Does anyone know how to solve this problem?

Below is the full output:
$ bin/nutch generate crawl/crawldb crawl/segments
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]
2023-05-14 19:01:16,433 INFO o.a.n.p.PluginManifestParser [main] Plugins:
looking in:
C:\Users\User\Desktop\wiki\a\ApacheNutch\apache-nutch-1.19\plugins
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Plugin
Auto-activation mode: [true]
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Registered
Plugins:
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Regex URL
Filter (urlfilter-regex)
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Html Parse
Plug-in (parse-html)
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]HTTP
Framework (lib-http)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]    the nutch
core extension points (nutch-extensionpoints)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic
Indexing Filter (index-basic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Anchor
Indexing Filter (index-anchor)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Tika Parser
Plug-in (parse-tika)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic URL
Normalizer (urlnormalizer-basic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL
Filter Framework (lib-regex-filter)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL
Normalizer (urlnormalizer-regex)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]URL
Validator (urlfilter-validator)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]CyberNeko
HTML Parser (lib-nekohtml)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]OPIC
Scoring Plug-in (scoring-opic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]
  Pass-through URL Normalizer (urlnormalizer-pass)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Http
Protocol Plug-in (protocol-http)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]
  SolrIndexWriter (indexer-solr)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] Registered
Extension-Points:
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] (Nutch
Content Parser)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (HTML
Parse Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Scoring)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Normalizer)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Publisher)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Exchange)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Protocol)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Ignore Exemption Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Index Writer)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Segment Merge Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Indexing Filter)
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: starting
at 2023-05-14 19:01:16
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: Selecting
best-scoring urls due for fetch.
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: filtering:
true
2023-0

Nutch 1.19 Getting Error: 'boolean org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String, int)'

2023-05-14 Thread Eric Valencia
Hello everyone,

So, I set up Nutch 1.19, Solr 8.11.2, and hadoop 3.3.5, to the best of my
knowledge.

After, I went into the nutch directory and ran this command:
*bin/nutch generate crawl/crawldb crawl/segments*

Then, I got an error:
*Exception in thread "main" java.lang.UnsatisfiedLinkError: 'boolean
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String,
int)'*

Does anyone know how to solve this problem?

Below is the full output:
$ bin/nutch generate crawl/crawldb crawl/segments
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in
[jar:file:/C:/Users/User/Desktop/wiki/a/ApacheNutch/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]
2023-05-14 19:01:16,433 INFO o.a.n.p.PluginManifestParser [main] Plugins:
looking in:
C:\Users\User\Desktop\wiki\a\ApacheNutch\apache-nutch-1.19\plugins
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Plugin
Auto-activation mode: [true]
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main] Registered
Plugins:
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Regex URL
Filter (urlfilter-regex)
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]Html Parse
Plug-in (parse-html)
2023-05-14 19:01:16,558 INFO o.a.n.p.PluginRepository [main]HTTP
Framework (lib-http)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]    the nutch
core extension points (nutch-extensionpoints)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic
Indexing Filter (index-basic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Anchor
Indexing Filter (index-anchor)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Tika Parser
Plug-in (parse-tika)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Basic URL
Normalizer (urlnormalizer-basic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL
Filter Framework (lib-regex-filter)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Regex URL
Normalizer (urlnormalizer-regex)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]URL
Validator (urlfilter-validator)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]CyberNeko
HTML Parser (lib-nekohtml)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]OPIC
Scoring Plug-in (scoring-opic)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]
 Pass-through URL Normalizer (urlnormalizer-pass)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]Http
Protocol Plug-in (protocol-http)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main]
 SolrIndexWriter (indexer-solr)
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] Registered
Extension-Points:
2023-05-14 19:01:16,559 INFO o.a.n.p.PluginRepository [main] (Nutch
Content Parser)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (HTML
Parse Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Scoring)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Normalizer)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Publisher)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Exchange)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Protocol)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch URL
Ignore Exemption Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Index Writer)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Segment Merge Filter)
2023-05-14 19:01:16,560 INFO o.a.n.p.PluginRepository [main] (Nutch
Indexing Filter)
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: starting
at 2023-05-14 19:01:16
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: Selecting
best-scoring urls due for fetch.
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator: filtering:
true
2023-05-14 19:01:16,969 INFO o.a.n.c.Generator [main] Generator:
normalizing: true
2023-05-14 19:01:16,974 INFO o.a.n.c.Generator [main] Generator: running in
local mode, generating exactly one partition.
Exception in thread "main" java.lang.UnsatisfiedLinkError: 'boolean
org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(java.lang.String,
int)'
at org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Native
Method)
at
org.apache.hadoop.io.nativeio.NativeIO$Windows.acce

Re: Nutch 1.19/Hadoop compatible

2023-03-07 Thread Markus Jelsma
Hello Mike,

> Is nutch 1.19 compatible with Hadoop 3.3.4?
Yes!

Regards,
Markus

Op di 7 mrt 2023 om 17:37 schreef Mike :

> Hello!
>
> Is nutch 1.19 compatible with Hadoop 3.3.4?
>
>
> Thanks!
>
> mike
>


Nutch 1.19/Hadoop compatible

2023-03-07 Thread Mike
Hello!

Is nutch 1.19 compatible with Hadoop 3.3.4?


Thanks!

mike


Re: Configuration Nutch in cluster mode

2023-01-17 Thread Mike
Hallo Sebastian!

I have now installed hadoop, unfortunately there are problems.
Will make a post..

Thanks
Mike

Am Di., 17. Jan. 2023 um 09:49 Uhr schrieb Sebastian Nagel
:

> Hi Mike,
>
> the Nutch configuration files are included in the job file found in
> runtime/deploy after build. This means you need to compile Nutch yourself
> if used in "distributed" mode.
>
> For exercising, you can first work in "pseudo-distributed" mode, i.e.
> on a single-node Hadoop cluster. All commands are the same than in fully
> distributed mode.
>
> If it helps, I prepared some setup scripts to run Nutch in
> pseudo-distributed mode:
>https://github.com/sebastian-nagel/nutch-test-single-node-cluster
>
> Best,
> Sebastian
>
> On 1/15/23 04:26, Mike wrote:
> > I will now try to configure the bot url etc. before the building,
> > but how and where do I configure between the crawls e.g. number of pages
> > per host?
> >
> > where do I configure nutch in cluster mode?
> >
> > thx, mike
> >
>


Re: Configuration Nutch in cluster mode

2023-01-17 Thread Sebastian Nagel

Hi Mike,

the Nutch configuration files are included in the job file found in 
runtime/deploy after build. This means you need to compile Nutch yourself

if used in "distributed" mode.

For exercising, you can first work in "pseudo-distributed" mode, i.e.
on a single-node Hadoop cluster. All commands are the same than in fully 
distributed mode.


If it helps, I prepared some setup scripts to run Nutch in pseudo-distributed 
mode:
  https://github.com/sebastian-nagel/nutch-test-single-node-cluster

Best,
Sebastian

On 1/15/23 04:26, Mike wrote:

I will now try to configure the bot url etc. before the building,
but how and where do I configure between the crawls e.g. number of pages
per host?

where do I configure nutch in cluster mode?

thx, mike



Re: Nutch/Hadoop Cluster

2023-01-17 Thread Sebastian Nagel

Hi Mike,

> It can be tedious to set up for the first time, and there are many components.

In case you prefer Linux packages, I can recommend Apache Bigtop, see
   https://bigtop.apache.org/
and for the list of package repositories
   https://downloads.apache.org/bigtop/stable/repos/

~Sebastian

On 1/15/23 01:06, Markus Jelsma wrote:

Hello Mike,


would it pay off for me to put a hadoop cluster on top of the 3 servers.


Yes, for as many reasons as Hadoop exists for. It can be tedious to set up
for the first time, and there are many components. But at least you have
three servers, which is kind of required by Zookeeper, that you will also
need.

Ideally you would have some additional VMs to run the controlling Hadoop
programs and perhaps the Hadoop client nodes on. The workers can run on
bare metal.


1.) a server would not be integrated directly into the crawl process as a

master.

What do you mean? Can you elaborate?


2.) can I run multiple crawl jobs on one server?


Yes! Just have separate instances of Nutch home dirs on your Hadoop client
nodes, each having their own configuration.

Regards,
Markus

Op za 14 jan. 2023 om 18:42 schreef Mike :


Hi!

I am now crawling the internet in local mode in parallel with up to 10
instances on 3 computers. would it pay off for me to put a hadoop cluster
on top of the 3 servers.

1.) a server would not be integrated directly into the crawl process as a
master.
2.) can I run multiple crawl jobs on one server?

Thanks





Configuration Nutch in cluster mode

2023-01-14 Thread Mike
I will now try to configure the bot url etc. before the building,
but how and where do I configure between the crawls e.g. number of pages
per host?

where do I configure nutch in cluster mode?

thx, mike


Re: Nutch/Hadoop Cluster

2023-01-14 Thread Markus Jelsma
Hello Mike,

> would it pay off for me to put a hadoop cluster on top of the 3 servers.

Yes, for as many reasons as Hadoop exists for. It can be tedious to set up
for the first time, and there are many components. But at least you have
three servers, which is kind of required by Zookeeper, that you will also
need.

Ideally you would have some additional VMs to run the controlling Hadoop
programs and perhaps the Hadoop client nodes on. The workers can run on
bare metal.

> 1.) a server would not be integrated directly into the crawl process as a
master.

What do you mean? Can you elaborate?

> 2.) can I run multiple crawl jobs on one server?

Yes! Just have separate instances of Nutch home dirs on your Hadoop client
nodes, each having their own configuration.

Regards,
Markus

Op za 14 jan. 2023 om 18:42 schreef Mike :

> Hi!
>
> I am now crawling the internet in local mode in parallel with up to 10
> instances on 3 computers. would it pay off for me to put a hadoop cluster
> on top of the 3 servers.
>
> 1.) a server would not be integrated directly into the crawl process as a
> master.
> 2.) can I run multiple crawl jobs on one server?
>
> Thanks
>


Nutch/Hadoop Cluster

2023-01-14 Thread Mike
Hi!

I am now crawling the internet in local mode in parallel with up to 10
instances on 3 computers. would it pay off for me to put a hadoop cluster
on top of the 3 servers.

1.) a server would not be integrated directly into the crawl process as a
master.
2.) can I run multiple crawl jobs on one server?

Thanks


Re: Nutch/Hadoop: Error (FreeGenerator job did not succeed)

2022-10-14 Thread Markus Jelsma
Hello,

You cannot just run Nutch's JAR like that on Hadoop, you need the large
.job file instead. If you build Nutch from source, you will get a
runtime/deploy directory. Upload its contents to a Hadoop client and run
Nutch commands using bin/nutch ... You will then automatically use the
large .job file that is on the same level as the bin directory.

Application log files on Hadoop are to be found everywhere. Select
individuel mapper or reduce subtasks, click deeper, and look to inspect
their logs. That is where the application logs are to be found.

Good luck!
Markus

Op vr 14 okt. 2022 om 16:18 schreef Mike :

> Hi!
>
> I've been using Nutch for a while but I'm new to hadoop. got a cluster with
> hadoop 3.2.3 installed.
>
> do i have to install nutch on the hadoop filesystem or can i run it
> "local"? the clients don't need more from nutch than the info on master in
> the command line: hadoop jar /home/debian/nutch40/lib/apache-nutch-1.19.jar
> org.apache.nutch.tools.FreeGenerator -conf /home/debian/
> nutch40/conf/nutch-default.xml
> -Dplugin.folder=/home/debian/nutch40/plugins/
> /crawl/urls//tranco-top350k-20221007.txt /home/debian/crawl/segments/
>
> I get an error on the command:
>
> Exception in thread "main" java.lang.RuntimeException: FreeGenerator job
> did not succeed, job id: job_1665751705815_0007, job status: FAILED,
> reason: Task failed task_1665751705815_0007_m_00
>
>
> Since I'm new I can't find the logs in hadoop properly yet.
>
> Is there a guide how to install Natch (1.19) on Hadoop that I can't find?
>
> Thanks
> Mike
>


Nutch/Hadoop: Error (FreeGenerator job did not succeed)

2022-10-14 Thread Mike
Hi!

I've been using Nutch for a while but I'm new to hadoop. got a cluster with
hadoop 3.2.3 installed.

do i have to install nutch on the hadoop filesystem or can i run it
"local"? the clients don't need more from nutch than the info on master in
the command line: hadoop jar /home/debian/nutch40/lib/apache-nutch-1.19.jar
org.apache.nutch.tools.FreeGenerator -conf /home/debian/
nutch40/conf/nutch-default.xml
-Dplugin.folder=/home/debian/nutch40/plugins/
/crawl/urls//tranco-top350k-20221007.txt /home/debian/crawl/segments/

I get an error on the command:

Exception in thread "main" java.lang.RuntimeException: FreeGenerator job
did not succeed, job id: job_1665751705815_0007, job status: FAILED,
reason: Task failed task_1665751705815_0007_m_00


Since I'm new I can't find the logs in hadoop properly yet.

Is there a guide how to install Natch (1.19) on Hadoop that I can't find?

Thanks
Mike


[ANNOUNCE] Apache Nutch 1.19 Release

2022-09-08 Thread Sebastian Nagel
The Apache Nutch team is pleased to announce the release of
Apache Nutch v1.19.

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.

Source and binary distributions are available for download from the
Apache Nutch download site:
   https://nutch.apache.org/downloads.html

Please verify signatures using the KEYS file available at the above
location when downloading the release.

This release includes more than 80 bug fixes and improvements, the full
list of changes can be seen in the release report
  https://s.apache.org/lf6li
Please also check the changelog for breaking changes:
  https://apache.org/dist/nutch/1.19/CHANGES.txt

Important changes are:
- Nutch builds on JDK 11
- protocol plugins can provide a custom URL stream handler to support
  custom URL schemes, eg. smb://
and notable dependency upgrades include:
  Hadoop 3.3.4
  Solr 8.11.2
  Tika 2.3.0

Thanks to everyone who contributed to this release!




[RESULT] was [VOTE] Release Apache Nutch 1.19 RC#1

2022-09-06 Thread Sebastian Nagel
Hi Folks,

thanks to everyone who was able to review the release candidate!

72 hours have definitely passed, please see below for vote results.

[4] +1 Release this package as Apache Nutch 1.19
   Markus Jelsma *
   BlackIce *
   Jorge Betancourt *
   Sebastian Nagel *

[0] -1 Do not release this package because ...

* Nutch PMC

The VOTE passes with 4 binding votes from Nutch PMC members.

I'll continue to publish the release packages and announce the release.

Thanks to everyone who contributed to Nutch and the 1.19 release.

Sebastian


On 8/22/22 17:30, Sebastian Nagel wrote:
> Hi Folks,
> 
> A first candidate for the Nutch 1.19 release is available at:
> 
>https://dist.apache.org/repos/dist/dev/nutch/1.19/
> 
> The release candidate is a zip and tar.gz archive of the binary and sources 
> in:
>    https://github.com/apache/nutch/tree/release-1.19
> 
> In addition, a staged maven repository is available here:
>https://repository.apache.org/content/repositories/orgapachenutch-1020
> 
> We addressed 87 issues:
>https://s.apache.org/lf6li
> 
> 
> Please vote on releasing this package as Apache Nutch 1.19.
> The vote is open for the next 72 hours and passes if a majority
> of at least three +1 Nutch PMC votes are cast.
> 
> [ ] +1 Release this package as Apache Nutch 1.19.
> [ ] -1 Do not release this package because…
> 
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
> 
> P.S.
> Here is my +1.
> - tested most of Nutch tools and run a test crawl on a single-node cluster
>   running Hadoop 3.3.4, see
>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)


Re: Nutch 1.19 schema.xml

2022-09-04 Thread Sebastian Nagel
Hi Mike,

I think there shouldn't be any issues upgrading the new schema.xml into the Solr
core holding the index filled from Nutch. Maybe with two exceptions:
- index-geoip is used (then some field definitions may change)
- when an older Solr version is used (eg. not yet supporting
  solr.LatLonPointSpatialField)

In doubt, I'd run a test before to be sure that the production system
isn't broken.

Best,
Sebastian

On 9/4/22 18:08, Mike wrote:
> Hello Sebastian!
> 
> Thanks for your answer!
> Is it possible to simply update the schema.xml file without re-indexing?
> 
> Thanks
> Mike
> 
> Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel
> :
> 
>> Hi Mike,
>>
>> the Nutch/Solr schema.xml will be updated with the release of 1.19
>> (expected
>> soon, a vote about RC#1 is ongoing):
>>  [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType
>>  [NUTCH-2957] - add fall-back field definitions for unknown index fields
>>  [NUTCH-2956] - typos in field names filled by index-geoip
>>
>> See the commits on the schema.xml
>>
>> https://github.com/apache/nutch/commits/master/src/plugin/indexer-solr/schema.xml
>>
>> Best,
>> Sebastian
>>
>>
>> On 8/31/22 14:02, Mike wrote:
>>> Hello!
>>>
>>>
>>> Will the schema.xml stay the same in Nutch 1.19?
>>>
>>> thanks!
>>>
>>> mike
>>>
>>
> 


Re: Nutch 1.19 schema.xml

2022-09-04 Thread Mike
Hello Sebastian!

Thanks for your answer!
Is it possible to simply update the schema.xml file without re-indexing?

Thanks
Mike

Am Fr., 2. Sept. 2022 um 13:25 Uhr schrieb Sebastian Nagel
:

> Hi Mike,
>
> the Nutch/Solr schema.xml will be updated with the release of 1.19
> (expected
> soon, a vote about RC#1 is ongoing):
>  [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType
>  [NUTCH-2957] - add fall-back field definitions for unknown index fields
>  [NUTCH-2956] - typos in field names filled by index-geoip
>
> See the commits on the schema.xml
>
> https://github.com/apache/nutch/commits/master/src/plugin/indexer-solr/schema.xml
>
> Best,
> Sebastian
>
>
> On 8/31/22 14:02, Mike wrote:
> > Hello!
> >
> >
> > Will the schema.xml stay the same in Nutch 1.19?
> >
> > thanks!
> >
> > mike
> >
>


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-09-02 Thread Sebastian Nagel
Hi Markus,

thanks!

Could you share the files in

  .ivy2/cache/org.apache.httpcomponents/httpasyncclient/

and maybe also the logs of a Nutch build starting with an empty ~/.ivy2/cache ?
I'll have a look and compare it what I find on my system - maybe use a new
thread on user@ or a Jira issue, I'll plan to close the vote over the weekend,
so let's keep this thread for the release vote alone.

Best,
Sebastian

On 8/29/22 14:17, Markus Jelsma wrote:
> Hello Sebastian,
> 
> No, the JAR isn't present. Multiple JARs are missing, probably because they
> are loaded after httpasyncclient. I checked the previously emptied Ivy
> cache. The Ivy files are there, but the JAR is missing there too.
> 
> markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/
> ivy-4.1.4.xml  ivy-4.1.4.xml.original  ivydata-4.1.4.properties
> 
> I manually downloaded the JAR from [1] and added it to the jars/ directory
> in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs
> some more things than just adding the JAR manually.
> 
> The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties
> file in the cache.
> 
> Since Ralf can compile it without problems, it seems to be an issue on my
> machine only. So Nutch seems fine, therefore +1.
> 
> Regards,
> Markus
> 
> [1]
> https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/
> 
> 
> Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel
> :
> 
>> Hi Ralf,
>>
>>> It fetches it parses
>>
>> So a +1 ?
>>
>> Best,
>> Sebastian
>>
>> On 8/25/22 05:22, BlackIce wrote:
>>> nevermind I made a typo...
>>>
>>> It fetches it parses
>>>
>>> On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
>>>>
>>>> so far... it doesn't select anything when creating segments:
>>>> 0 records selected for fetching, exiting
>>>>
>>>> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
>>>>>
>>>>> I have been able to compile under OpenJDK 11
>>>>> Have not done anything further so far
>>>>> I'm gonna try to get to it this evening
>>>>>
>>>>> Greetz
>>>>> Ralf
>>>>>
>>>>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
>>>>>  wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> Everything seems fine, the crawler seems fine when trying the binary
>>>>>> distribution. The source won't work because this computer still cannot
>>>>>> compile it. Clearing the local Ivy cache did not do much. This is the
>> known
>>>>>> compiler error with the elastic-indexer plugin:
>>>>>> compile:
>>>>>> [echo] Compiling plugin: indexer-elastic
>>>>>>[javac] Compiling 3 source files to
>>>>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
>>>>>>[javac]
>>>>>>
>> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
>>>>>> error: package org.apache.http.impl.nio.client does not exist
>>>>>>[javac] import
>> org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
>>>>>>[javac]   ^
>>>>>>[javac] 1 error
>>>>>>
>>>>>>
>>>>>> The binary distribution works fine though. I do see a lot of new
>> messages
>>>>>> when fetching:
>>>>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters
>> [LocalJobRunner
>>>>>> Map Task Executor #0] Found 0 extensions at
>>>>>> point:'org.apache.nutch.net.URLExemptionFilter'
>>>>>>
>>>>>> This is also new at start of each task:
>>>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>>>> SLF4J: Found binding in
>>>>>>
>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>>
>>>>>> SLF4J: Found binding in
>>>>>>
>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>>>
>>>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>>>> explanation.
>>>>>> SLF4

Re: Nutch 1.19 schema.xml

2022-09-02 Thread Sebastian Nagel
Hi Mike,

the Nutch/Solr schema.xml will be updated with the release of 1.19 (expected
soon, a vote about RC#1 is ongoing):
 [NUTCH-2955] - replace deprecated/removed field type solr.LatLonType
 [NUTCH-2957] - add fall-back field definitions for unknown index fields
 [NUTCH-2956] - typos in field names filled by index-geoip

See the commits on the schema.xml
  
https://github.com/apache/nutch/commits/master/src/plugin/indexer-solr/schema.xml

Best,
Sebastian


On 8/31/22 14:02, Mike wrote:
> Hello!
> 
> 
> Will the schema.xml stay the same in Nutch 1.19?
> 
> thanks!
> 
> mike
> 


Nutch 1.19 schema.xml

2022-08-31 Thread Mike
Hello!


Will the schema.xml stay the same in Nutch 1.19?

thanks!

mike


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-31 Thread Jorge Betancourt
Hi all,

Compiled from the sources (JDK11) and ran a small crawl and indexing (to
Solr) both passed with flying colors.

That's a +1 from me. Great work Sebastian!

On Mon, Aug 22, 2022 at 5:30 PM Sebastian Nagel  wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.19 release is available at:
>
>https://dist.apache.org/repos/dist/dev/nutch/1.19/
>
> The release candidate is a zip and tar.gz archive of the binary and
> sources in:
>    https://github.com/apache/nutch/tree/release-1.19
>
> In addition, a staged maven repository is available here:
>https://repository.apache.org/content/repositories/orgapachenutch-1020
>
> We addressed 87 issues:
>https://s.apache.org/lf6li
>
>
> Please vote on releasing this package as Apache Nutch 1.19.
> The vote is open for the next 72 hours and passes if a majority
> of at least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.19.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
>
> P.S.
> Here is my +1.
> - tested most of Nutch tools and run a test crawl on a single-node cluster
>   running Hadoop 3.3.4, see
>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
>


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-30 Thread BlackIce
OK,
I compiled Nutch under JDK11
Did some basic fetching, parsing, linkinversion and posterior indexing to Solr 9
[+1]

Great work!
RRK

On Tue, Aug 30, 2022 at 12:22 PM BlackIce  wrote:
>
> Tried some indexing... but when manually doing "Invertilinks" it says
> something about input path does not exist.
> Has invertilinks changed since 1.18?
>
> Greetz
> RRK
>
> On Mon, Aug 29, 2022 at 3:38 PM BlackIce  wrote:
> >
> > Haven't indexed anything to solr.. gonna give it a shot in a few hours
> >
> > On Mon, Aug 29, 2022 at 2:17 PM Markus Jelsma
> >  wrote:
> > >
> > > Hello Sebastian,
> > >
> > > No, the JAR isn't present. Multiple JARs are missing, probably because 
> > > they
> > > are loaded after httpasyncclient. I checked the previously emptied Ivy
> > > cache. The Ivy files are there, but the JAR is missing there too.
> > >
> > > markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/
> > > ivy-4.1.4.xml  ivy-4.1.4.xml.original  ivydata-4.1.4.properties
> > >
> > > I manually downloaded the JAR from [1] and added it to the jars/ directory
> > > in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache 
> > > needs
> > > some more things than just adding the JAR manually.
> > >
> > > The odd thing is, that i got the URL below FROM the 
> > > ivydata-4.1.4.properties
> > > file in the cache.
> > >
> > > Since Ralf can compile it without problems, it seems to be an issue on my
> > > machine only. So Nutch seems fine, therefore +1.
> > >
> > > Regards,
> > > Markus
> > >
> > > [1]
> > > https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/
> > >
> > >
> > > Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel
> > > :
> > >
> > > > Hi Ralf,
> > > >
> > > > > It fetches it parses
> > > >
> > > > So a +1 ?
> > > >
> > > > Best,
> > > > Sebastian
> > > >
> > > > On 8/25/22 05:22, BlackIce wrote:
> > > > > nevermind I made a typo...
> > > > >
> > > > > It fetches it parses
> > > > >
> > > > > On Thu, Aug 25, 2022 at 3:42 AM BlackIce  
> > > > > wrote:
> > > > >>
> > > > >> so far... it doesn't select anything when creating segments:
> > > > >> 0 records selected for fetching, exiting
> > > > >>
> > > > >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  
> > > > >> wrote:
> > > > >>>
> > > > >>> I have been able to compile under OpenJDK 11
> > > > >>> Have not done anything further so far
> > > > >>> I'm gonna try to get to it this evening
> > > > >>>
> > > > >>> Greetz
> > > > >>> Ralf
> > > > >>>
> > > > >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
> > > > >>>  wrote:
> > > > >>>>
> > > > >>>> Hi,
> > > > >>>>
> > > > >>>> Everything seems fine, the crawler seems fine when trying the 
> > > > >>>> binary
> > > > >>>> distribution. The source won't work because this computer still 
> > > > >>>> cannot
> > > > >>>> compile it. Clearing the local Ivy cache did not do much. This is 
> > > > >>>> the
> > > > known
> > > > >>>> compiler error with the elastic-indexer plugin:
> > > > >>>> compile:
> > > > >>>> [echo] Compiling plugin: indexer-elastic
> > > > >>>>[javac] Compiling 3 source files to
> > > > >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
> > > > >>>>[javac]
> > > > >>>>
> > > > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> > > > >>>> error: package org.apache.http.impl.nio.client does not exist
> > > > >>>>[javac] import
> > > > org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
> > > > >>>>[javac] 

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-30 Thread BlackIce
Tried some indexing... but when manually doing "Invertilinks" it says
something about input path does not exist.
Has invertilinks changed since 1.18?

Greetz
RRK

On Mon, Aug 29, 2022 at 3:38 PM BlackIce  wrote:
>
> Haven't indexed anything to solr.. gonna give it a shot in a few hours
>
> On Mon, Aug 29, 2022 at 2:17 PM Markus Jelsma
>  wrote:
> >
> > Hello Sebastian,
> >
> > No, the JAR isn't present. Multiple JARs are missing, probably because they
> > are loaded after httpasyncclient. I checked the previously emptied Ivy
> > cache. The Ivy files are there, but the JAR is missing there too.
> >
> > markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/
> > ivy-4.1.4.xml  ivy-4.1.4.xml.original  ivydata-4.1.4.properties
> >
> > I manually downloaded the JAR from [1] and added it to the jars/ directory
> > in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs
> > some more things than just adding the JAR manually.
> >
> > The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties
> > file in the cache.
> >
> > Since Ralf can compile it without problems, it seems to be an issue on my
> > machine only. So Nutch seems fine, therefore +1.
> >
> > Regards,
> > Markus
> >
> > [1]
> > https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/
> >
> >
> > Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel
> > :
> >
> > > Hi Ralf,
> > >
> > > > It fetches it parses
> > >
> > > So a +1 ?
> > >
> > > Best,
> > > Sebastian
> > >
> > > On 8/25/22 05:22, BlackIce wrote:
> > > > nevermind I made a typo...
> > > >
> > > > It fetches it parses
> > > >
> > > > On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
> > > >>
> > > >> so far... it doesn't select anything when creating segments:
> > > >> 0 records selected for fetching, exiting
> > > >>
> > > >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
> > > >>>
> > > >>> I have been able to compile under OpenJDK 11
> > > >>> Have not done anything further so far
> > > >>> I'm gonna try to get to it this evening
> > > >>>
> > > >>> Greetz
> > > >>> Ralf
> > > >>>
> > > >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
> > > >>>  wrote:
> > > >>>>
> > > >>>> Hi,
> > > >>>>
> > > >>>> Everything seems fine, the crawler seems fine when trying the binary
> > > >>>> distribution. The source won't work because this computer still 
> > > >>>> cannot
> > > >>>> compile it. Clearing the local Ivy cache did not do much. This is the
> > > known
> > > >>>> compiler error with the elastic-indexer plugin:
> > > >>>> compile:
> > > >>>> [echo] Compiling plugin: indexer-elastic
> > > >>>>[javac] Compiling 3 source files to
> > > >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
> > > >>>>[javac]
> > > >>>>
> > > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> > > >>>> error: package org.apache.http.impl.nio.client does not exist
> > > >>>>[javac] import
> > > org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
> > > >>>>[javac]   ^
> > > >>>>[javac] 1 error
> > > >>>>
> > > >>>>
> > > >>>> The binary distribution works fine though. I do see a lot of new
> > > messages
> > > >>>> when fetching:
> > > >>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters
> > > [LocalJobRunner
> > > >>>> Map Task Executor #0] Found 0 extensions at
> > > >>>> point:'org.apache.nutch.net.URLExemptionFilter'
> > > >>>>
> > > >>>> This is also new at start of each task:
> > > >>>> SLF4J: Class path contains multiple SLF4J bindings.
> > > >>>> SLF4J: Found binding in
> > > >>>

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-29 Thread BlackIce
Haven't indexed anything to solr.. gonna give it a shot in a few hours

On Mon, Aug 29, 2022 at 2:17 PM Markus Jelsma
 wrote:
>
> Hello Sebastian,
>
> No, the JAR isn't present. Multiple JARs are missing, probably because they
> are loaded after httpasyncclient. I checked the previously emptied Ivy
> cache. The Ivy files are there, but the JAR is missing there too.
>
> markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/
> ivy-4.1.4.xml  ivy-4.1.4.xml.original  ivydata-4.1.4.properties
>
> I manually downloaded the JAR from [1] and added it to the jars/ directory
> in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs
> some more things than just adding the JAR manually.
>
> The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties
> file in the cache.
>
> Since Ralf can compile it without problems, it seems to be an issue on my
> machine only. So Nutch seems fine, therefore +1.
>
> Regards,
> Markus
>
> [1]
> https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/
>
>
> Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel
> :
>
> > Hi Ralf,
> >
> > > It fetches it parses
> >
> > So a +1 ?
> >
> > Best,
> > Sebastian
> >
> > On 8/25/22 05:22, BlackIce wrote:
> > > nevermind I made a typo...
> > >
> > > It fetches it parses
> > >
> > > On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
> > >>
> > >> so far... it doesn't select anything when creating segments:
> > >> 0 records selected for fetching, exiting
> > >>
> > >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
> > >>>
> > >>> I have been able to compile under OpenJDK 11
> > >>> Have not done anything further so far
> > >>> I'm gonna try to get to it this evening
> > >>>
> > >>> Greetz
> > >>> Ralf
> > >>>
> > >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
> > >>>  wrote:
> > >>>>
> > >>>> Hi,
> > >>>>
> > >>>> Everything seems fine, the crawler seems fine when trying the binary
> > >>>> distribution. The source won't work because this computer still cannot
> > >>>> compile it. Clearing the local Ivy cache did not do much. This is the
> > known
> > >>>> compiler error with the elastic-indexer plugin:
> > >>>> compile:
> > >>>> [echo] Compiling plugin: indexer-elastic
> > >>>>[javac] Compiling 3 source files to
> > >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
> > >>>>[javac]
> > >>>>
> > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> > >>>> error: package org.apache.http.impl.nio.client does not exist
> > >>>>    [javac] import
> > org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
> > >>>>[javac]   ^
> > >>>>[javac] 1 error
> > >>>>
> > >>>>
> > >>>> The binary distribution works fine though. I do see a lot of new
> > messages
> > >>>> when fetching:
> > >>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters
> > [LocalJobRunner
> > >>>> Map Task Executor #0] Found 0 extensions at
> > >>>> point:'org.apache.nutch.net.URLExemptionFilter'
> > >>>>
> > >>>> This is also new at start of each task:
> > >>>> SLF4J: Class path contains multiple SLF4J bindings.
> > >>>> SLF4J: Found binding in
> > >>>>
> > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > >>>>
> > >>>> SLF4J: Found binding in
> > >>>>
> > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > >>>>
> > >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > >>>> explanation.
> > >>>> SLF4J: Actual binding is of type
> > >>>> [org.apache.logging.slf4j.Log4jLoggerFactory]
> > >>>>
> > >>>> And this one at the end of fetcher:
> > >

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-29 Thread Markus Jelsma
Hello Sebastian,

No, the JAR isn't present. Multiple JARs are missing, probably because they
are loaded after httpasyncclient. I checked the previously emptied Ivy
cache. The Ivy files are there, but the JAR is missing there too.

markus@midas:~$ ls .ivy2/cache/org.apache.httpcomponents/httpasyncclient/
ivy-4.1.4.xml  ivy-4.1.4.xml.original  ivydata-4.1.4.properties

I manually downloaded the JAR from [1] and added it to the jars/ directory
in the Ivy cache. It still cannot find the JAR, perhaps the Ivy cache needs
some more things than just adding the JAR manually.

The odd thing is, that i got the URL below FROM the ivydata-4.1.4.properties
file in the cache.

Since Ralf can compile it without problems, it seems to be an issue on my
machine only. So Nutch seems fine, therefore +1.

Regards,
Markus

[1]
https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/


Op zo 28 aug. 2022 om 12:05 schreef Sebastian Nagel
:

> Hi Ralf,
>
> > It fetches it parses
>
> So a +1 ?
>
> Best,
> Sebastian
>
> On 8/25/22 05:22, BlackIce wrote:
> > nevermind I made a typo...
> >
> > It fetches it parses
> >
> > On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
> >>
> >> so far... it doesn't select anything when creating segments:
> >> 0 records selected for fetching, exiting
> >>
> >> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
> >>>
> >>> I have been able to compile under OpenJDK 11
> >>> Have not done anything further so far
> >>> I'm gonna try to get to it this evening
> >>>
> >>> Greetz
> >>> Ralf
> >>>
> >>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
> >>>  wrote:
> >>>>
> >>>> Hi,
> >>>>
> >>>> Everything seems fine, the crawler seems fine when trying the binary
> >>>> distribution. The source won't work because this computer still cannot
> >>>> compile it. Clearing the local Ivy cache did not do much. This is the
> known
> >>>> compiler error with the elastic-indexer plugin:
> >>>> compile:
> >>>> [echo] Compiling plugin: indexer-elastic
> >>>>[javac] Compiling 3 source files to
> >>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
> >>>>[javac]
> >>>>
> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> >>>> error: package org.apache.http.impl.nio.client does not exist
> >>>>[javac] import
> org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
> >>>>[javac]   ^
> >>>>[javac] 1 error
> >>>>
> >>>>
> >>>> The binary distribution works fine though. I do see a lot of new
> messages
> >>>> when fetching:
> >>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters
> [LocalJobRunner
> >>>> Map Task Executor #0] Found 0 extensions at
> >>>> point:'org.apache.nutch.net.URLExemptionFilter'
> >>>>
> >>>> This is also new at start of each task:
> >>>> SLF4J: Class path contains multiple SLF4J bindings.
> >>>> SLF4J: Found binding in
> >>>>
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >>>>
> >>>> SLF4J: Found binding in
> >>>>
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >>>>
> >>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> >>>> explanation.
> >>>> SLF4J: Actual binding is of type
> >>>> [org.apache.logging.slf4j.Log4jLoggerFactory]
> >>>>
> >>>> And this one at the end of fetcher:
> >>>> log4j:WARN No appenders could be found for logger
> >>>> (org.apache.commons.httpclient.params.DefaultHttpParams).
> >>>> log4j:WARN Please initialize the log4j system properly.
> >>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig
> for
> >>>> more info.
> >>>>
> >>>> I am worried about the indexer-elastic plugin, maybe others have that
> >>>> problem too? Otherwise everything seems fine.
> >>>>
> >>>> Markus
> >>>>
> >>>> Op ma 22 aug. 2022 

Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-28 Thread Sebastian Nagel
Hi Ralf,

> It fetches it parses

So a +1 ?

Best,
Sebastian

On 8/25/22 05:22, BlackIce wrote:
> nevermind I made a typo...
> 
> It fetches it parses
> 
> On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
>>
>> so far... it doesn't select anything when creating segments:
>> 0 records selected for fetching, exiting
>>
>> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
>>>
>>> I have been able to compile under OpenJDK 11
>>> Have not done anything further so far
>>> I'm gonna try to get to it this evening
>>>
>>> Greetz
>>> Ralf
>>>
>>> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
>>>  wrote:
>>>>
>>>> Hi,
>>>>
>>>> Everything seems fine, the crawler seems fine when trying the binary
>>>> distribution. The source won't work because this computer still cannot
>>>> compile it. Clearing the local Ivy cache did not do much. This is the known
>>>> compiler error with the elastic-indexer plugin:
>>>> compile:
>>>> [echo] Compiling plugin: indexer-elastic
>>>>[javac] Compiling 3 source files to
>>>> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
>>>>[javac]
>>>> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
>>>> error: package org.apache.http.impl.nio.client does not exist
>>>>[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
>>>>[javac]   ^
>>>>[javac] 1 error
>>>>
>>>>
>>>> The binary distribution works fine though. I do see a lot of new messages
>>>> when fetching:
>>>> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
>>>> Map Task Executor #0] Found 0 extensions at
>>>> point:'org.apache.nutch.net.URLExemptionFilter'
>>>>
>>>> This is also new at start of each task:
>>>> SLF4J: Class path contains multiple SLF4J bindings.
>>>> SLF4J: Found binding in
>>>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>
>>>> SLF4J: Found binding in
>>>> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>>>>
>>>> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
>>>> explanation.
>>>> SLF4J: Actual binding is of type
>>>> [org.apache.logging.slf4j.Log4jLoggerFactory]
>>>>
>>>> And this one at the end of fetcher:
>>>> log4j:WARN No appenders could be found for logger
>>>> (org.apache.commons.httpclient.params.DefaultHttpParams).
>>>> log4j:WARN Please initialize the log4j system properly.
>>>> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
>>>> more info.
>>>>
>>>> I am worried about the indexer-elastic plugin, maybe others have that
>>>> problem too? Otherwise everything seems fine.
>>>>
>>>> Markus
>>>>
>>>> Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :
>>>>
>>>>> Hi Folks,
>>>>>
>>>>> A first candidate for the Nutch 1.19 release is available at:
>>>>>
>>>>>https://dist.apache.org/repos/dist/dev/nutch/1.19/
>>>>>
>>>>> The release candidate is a zip and tar.gz archive of the binary and
>>>>> sources in:
>>>>>https://github.com/apache/nutch/tree/release-1.19
>>>>>
>>>>> In addition, a staged maven repository is available here:
>>>>>https://repository.apache.org/content/repositories/orgapachenutch-1020
>>>>>
>>>>> We addressed 87 issues:
>>>>>https://s.apache.org/lf6li
>>>>>
>>>>>
>>>>> Please vote on releasing this package as Apache Nutch 1.19.
>>>>> The vote is open for the next 72 hours and passes if a majority
>>>>> of at least three +1 Nutch PMC votes are cast.
>>>>>
>>>>> [ ] +1 Release this package as Apache Nutch 1.19.
>>>>> [ ] -1 Do not release this package because…
>>>>>
>>>>> Cheers,
>>>>> Sebastian
>>>>> (On behalf of the Nutch PMC)
>>>>>
>>>>> P.S.
>>>>> Here is my +1.
>>>>> - tested most of Nutch tools and run a test crawl on a single-node cluster
>>>>>   running Hadoop 3.3.4, see
>>>>>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
>>>>>


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-28 Thread Sebastian Nagel
Hi Markus,

thanks!  What's your (final) decision?


>[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;

During build the class should be provided in
  build/plugins/indexer-elastic/httpasyncclient-4.1.4.jar
Could you verify whether this jar is there and whether it contains the class
file? See also:
  
https://repo1.maven.org/maven2/org/apache/httpcomponents/httpasyncclient/4.1.4/

> I am worried about the indexer-elastic plugin, maybe others have that
> problem too? Otherwise everything seems fine.

In order to fix it, we need to make the error reproducible resp. figure out
what the reason is.


Regarding the logging: we switched to log4j 2.x (NUTCH-2915) while Hadoop now
uses reload4j (HADOOP-18088 [1]). The logging configuration should be improved
to avoid the warnings in local mode. In distributed mode, the logging
configuration of the provided Hadoop takes over.


Best,
Sebastian

[1] https://issues.apache.org/jira/browse/HADOOP-18088


On 8/24/22 13:28, Markus Jelsma wrote:
> Hi,
> 
> Everything seems fine, the crawler seems fine when trying the binary
> distribution. The source won't work because this computer still cannot
> compile it. Clearing the local Ivy cache did not do much. This is the known
> compiler error with the elastic-indexer plugin:
> compile:
> [echo] Compiling plugin: indexer-elastic
>[javac] Compiling 3 source files to
> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
>    [javac]
> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> error: package org.apache.http.impl.nio.client does not exist
>[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
>[javac]   ^
>[javac] 1 error
> 
> 
> The binary distribution works fine though. I do see a lot of new messages
> when fetching:
> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
> Map Task Executor #0] Found 0 extensions at
> point:'org.apache.nutch.net.URLExemptionFilter'
> 
> This is also new at start of each task:
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
> SLF4J: Found binding in
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> 
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type
> [org.apache.logging.slf4j.Log4jLoggerFactory]
> 
> And this one at the end of fetcher:
> log4j:WARN No appenders could be found for logger
> (org.apache.commons.httpclient.params.DefaultHttpParams).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
> 
> I am worried about the indexer-elastic plugin, maybe others have that
> problem too? Otherwise everything seems fine.
> 
> Markus
> 
> Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :
> 
>> Hi Folks,
>>
>> A first candidate for the Nutch 1.19 release is available at:
>>
>>https://dist.apache.org/repos/dist/dev/nutch/1.19/
>>
>> The release candidate is a zip and tar.gz archive of the binary and
>> sources in:
>>https://github.com/apache/nutch/tree/release-1.19
>>
>> In addition, a staged maven repository is available here:
>>https://repository.apache.org/content/repositories/orgapachenutch-1020
>>
>> We addressed 87 issues:
>>https://s.apache.org/lf6li
>>
>>
>> Please vote on releasing this package as Apache Nutch 1.19.
>> The vote is open for the next 72 hours and passes if a majority
>> of at least three +1 Nutch PMC votes are cast.
>>
>> [ ] +1 Release this package as Apache Nutch 1.19.
>> [ ] -1 Do not release this package because…
>>
>> Cheers,
>> Sebastian
>> (On behalf of the Nutch PMC)
>>
>> P.S.
>> Here is my +1.
>> - tested most of Nutch tools and run a test crawl on a single-node cluster
>>   running Hadoop 3.3.4, see
>>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
>>
> 


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread BlackIce
nevermind I made a typo...

It fetches it parses

On Thu, Aug 25, 2022 at 3:42 AM BlackIce  wrote:
>
> so far... it doesn't select anything when creating segments:
> 0 records selected for fetching, exiting
>
> On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
> >
> > I have been able to compile under OpenJDK 11
> > Have not done anything further so far
> > I'm gonna try to get to it this evening
> >
> > Greetz
> > Ralf
> >
> > On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
> >  wrote:
> > >
> > > Hi,
> > >
> > > Everything seems fine, the crawler seems fine when trying the binary
> > > distribution. The source won't work because this computer still cannot
> > > compile it. Clearing the local Ivy cache did not do much. This is the 
> > > known
> > > compiler error with the elastic-indexer plugin:
> > > compile:
> > > [echo] Compiling plugin: indexer-elastic
> > >[javac] Compiling 3 source files to
> > > /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
> > >[javac]
> > > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> > > error: package org.apache.http.impl.nio.client does not exist
> > >[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
> > >[javac]   ^
> > >[javac] 1 error
> > >
> > >
> > > The binary distribution works fine though. I do see a lot of new messages
> > > when fetching:
> > > 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
> > > Map Task Executor #0] Found 0 extensions at
> > > point:'org.apache.nutch.net.URLExemptionFilter'
> > >
> > > This is also new at start of each task:
> > > SLF4J: Class path contains multiple SLF4J bindings.
> > > SLF4J: Found binding in
> > > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > >
> > > SLF4J: Found binding in
> > > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> > >
> > > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > > explanation.
> > > SLF4J: Actual binding is of type
> > > [org.apache.logging.slf4j.Log4jLoggerFactory]
> > >
> > > And this one at the end of fetcher:
> > > log4j:WARN No appenders could be found for logger
> > > (org.apache.commons.httpclient.params.DefaultHttpParams).
> > > log4j:WARN Please initialize the log4j system properly.
> > > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> > > more info.
> > >
> > > I am worried about the indexer-elastic plugin, maybe others have that
> > > problem too? Otherwise everything seems fine.
> > >
> > > Markus
> > >
> > > Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :
> > >
> > > > Hi Folks,
> > > >
> > > > A first candidate for the Nutch 1.19 release is available at:
> > > >
> > > >https://dist.apache.org/repos/dist/dev/nutch/1.19/
> > > >
> > > > The release candidate is a zip and tar.gz archive of the binary and
> > > > sources in:
> > > >https://github.com/apache/nutch/tree/release-1.19
> > > >
> > > > In addition, a staged maven repository is available here:
> > > >
> > > > https://repository.apache.org/content/repositories/orgapachenutch-1020
> > > >
> > > > We addressed 87 issues:
> > > >https://s.apache.org/lf6li
> > > >
> > > >
> > > > Please vote on releasing this package as Apache Nutch 1.19.
> > > > The vote is open for the next 72 hours and passes if a majority
> > > > of at least three +1 Nutch PMC votes are cast.
> > > >
> > > > [ ] +1 Release this package as Apache Nutch 1.19.
> > > > [ ] -1 Do not release this package because…
> > > >
> > > > Cheers,
> > > > Sebastian
> > > > (On behalf of the Nutch PMC)
> > > >
> > > > P.S.
> > > > Here is my +1.
> > > > - tested most of Nutch tools and run a test crawl on a single-node 
> > > > cluster
> > > >   running Hadoop 3.3.4, see
> > > >   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
> > > >


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread BlackIce
so far... it doesn't select anything when creating segments:
0 records selected for fetching, exiting

On Wed, Aug 24, 2022 at 3:02 PM BlackIce  wrote:
>
> I have been able to compile under OpenJDK 11
> Have not done anything further so far
> I'm gonna try to get to it this evening
>
> Greetz
> Ralf
>
> On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
>  wrote:
> >
> > Hi,
> >
> > Everything seems fine, the crawler seems fine when trying the binary
> > distribution. The source won't work because this computer still cannot
> > compile it. Clearing the local Ivy cache did not do much. This is the known
> > compiler error with the elastic-indexer plugin:
> > compile:
> > [echo] Compiling plugin: indexer-elastic
> >[javac] Compiling 3 source files to
> > /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
> >    [javac]
> > /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> > error: package org.apache.http.impl.nio.client does not exist
> >[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
> >[javac]   ^
> >[javac] 1 error
> >
> >
> > The binary distribution works fine though. I do see a lot of new messages
> > when fetching:
> > 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
> > Map Task Executor #0] Found 0 extensions at
> > point:'org.apache.nutch.net.URLExemptionFilter'
> >
> > This is also new at start of each task:
> > SLF4J: Class path contains multiple SLF4J bindings.
> > SLF4J: Found binding in
> > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >
> > SLF4J: Found binding in
> > [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> >
> > SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> > explanation.
> > SLF4J: Actual binding is of type
> > [org.apache.logging.slf4j.Log4jLoggerFactory]
> >
> > And this one at the end of fetcher:
> > log4j:WARN No appenders could be found for logger
> > (org.apache.commons.httpclient.params.DefaultHttpParams).
> > log4j:WARN Please initialize the log4j system properly.
> > log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> > more info.
> >
> > I am worried about the indexer-elastic plugin, maybe others have that
> > problem too? Otherwise everything seems fine.
> >
> > Markus
> >
> > Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :
> >
> > > Hi Folks,
> > >
> > > A first candidate for the Nutch 1.19 release is available at:
> > >
> > >https://dist.apache.org/repos/dist/dev/nutch/1.19/
> > >
> > > The release candidate is a zip and tar.gz archive of the binary and
> > > sources in:
> > >https://github.com/apache/nutch/tree/release-1.19
> > >
> > > In addition, a staged maven repository is available here:
> > >https://repository.apache.org/content/repositories/orgapachenutch-1020
> > >
> > > We addressed 87 issues:
> > >https://s.apache.org/lf6li
> > >
> > >
> > > Please vote on releasing this package as Apache Nutch 1.19.
> > > The vote is open for the next 72 hours and passes if a majority
> > > of at least three +1 Nutch PMC votes are cast.
> > >
> > > [ ] +1 Release this package as Apache Nutch 1.19.
> > > [ ] -1 Do not release this package because…
> > >
> > > Cheers,
> > > Sebastian
> > > (On behalf of the Nutch PMC)
> > >
> > > P.S.
> > > Here is my +1.
> > > - tested most of Nutch tools and run a test crawl on a single-node cluster
> > >   running Hadoop 3.3.4, see
> > >   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
> > >


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread BlackIce
I have been able to compile under OpenJDK 11
Have not done anything further so far
I'm gonna try to get to it this evening

Greetz
Ralf

On Wed, Aug 24, 2022 at 1:29 PM Markus Jelsma
 wrote:
>
> Hi,
>
> Everything seems fine, the crawler seems fine when trying the binary
> distribution. The source won't work because this computer still cannot
> compile it. Clearing the local Ivy cache did not do much. This is the known
> compiler error with the elastic-indexer plugin:
> compile:
> [echo] Compiling plugin: indexer-elastic
>[javac] Compiling 3 source files to
> /home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
>    [javac]
> /home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
> error: package org.apache.http.impl.nio.client does not exist
>[javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
>[javac]   ^
>[javac] 1 error
>
>
> The binary distribution works fine though. I do see a lot of new messages
> when fetching:
> 2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
> Map Task Executor #0] Found 0 extensions at
> point:'org.apache.nutch.net.URLExemptionFilter'
>
> This is also new at start of each task:
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: Found binding in
> [jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]
>
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
> explanation.
> SLF4J: Actual binding is of type
> [org.apache.logging.slf4j.Log4jLoggerFactory]
>
> And this one at the end of fetcher:
> log4j:WARN No appenders could be found for logger
> (org.apache.commons.httpclient.params.DefaultHttpParams).
> log4j:WARN Please initialize the log4j system properly.
> log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
> more info.
>
> I am worried about the indexer-elastic plugin, maybe others have that
> problem too? Otherwise everything seems fine.
>
> Markus
>
> Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :
>
> > Hi Folks,
> >
> > A first candidate for the Nutch 1.19 release is available at:
> >
> >https://dist.apache.org/repos/dist/dev/nutch/1.19/
> >
> > The release candidate is a zip and tar.gz archive of the binary and
> > sources in:
> >https://github.com/apache/nutch/tree/release-1.19
> >
> > In addition, a staged maven repository is available here:
> >https://repository.apache.org/content/repositories/orgapachenutch-1020
> >
> > We addressed 87 issues:
> >https://s.apache.org/lf6li
> >
> >
> > Please vote on releasing this package as Apache Nutch 1.19.
> > The vote is open for the next 72 hours and passes if a majority
> > of at least three +1 Nutch PMC votes are cast.
> >
> > [ ] +1 Release this package as Apache Nutch 1.19.
> > [ ] -1 Do not release this package because…
> >
> > Cheers,
> > Sebastian
> > (On behalf of the Nutch PMC)
> >
> > P.S.
> > Here is my +1.
> > - tested most of Nutch tools and run a test crawl on a single-node cluster
> >   running Hadoop 3.3.4, see
> >   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
> >


Re: [VOTE] Release Apache Nutch 1.19 RC#1

2022-08-24 Thread Markus Jelsma
Hi,

Everything seems fine, the crawler seems fine when trying the binary
distribution. The source won't work because this computer still cannot
compile it. Clearing the local Ivy cache did not do much. This is the known
compiler error with the elastic-indexer plugin:
compile:
[echo] Compiling plugin: indexer-elastic
   [javac] Compiling 3 source files to
/home/markus/temp/apache-nutch-1.19/build/indexer-elastic/classes
   [javac]
/home/markus/temp/apache-nutch-1.19/src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java:39:
error: package org.apache.http.impl.nio.client does not exist
   [javac] import org.apache.http.impl.nio.client.HttpAsyncClientBuilder;
   [javac]   ^
   [javac] 1 error


The binary distribution works fine though. I do see a lot of new messages
when fetching:
2022-08-24 13:21:15,867 INFO o.a.n.n.URLExemptionFilters [LocalJobRunner
Map Task Executor #0] Found 0 extensions at
point:'org.apache.nutch.net.URLExemptionFilter'

This is also new at start of each task:
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in
[jar:file:/home/markus/temp/apache-nutch-1.19/lib/log4j-slf4j-impl-2.18.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/markus/temp/apache-nutch-1.19/lib/slf4j-reload4j-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.
SLF4J: Actual binding is of type
[org.apache.logging.slf4j.Log4jLoggerFactory]

And this one at the end of fetcher:
log4j:WARN No appenders could be found for logger
(org.apache.commons.httpclient.params.DefaultHttpParams).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for
more info.

I am worried about the indexer-elastic plugin, maybe others have that
problem too? Otherwise everything seems fine.

Markus

Op ma 22 aug. 2022 om 17:30 schreef Sebastian Nagel :

> Hi Folks,
>
> A first candidate for the Nutch 1.19 release is available at:
>
>https://dist.apache.org/repos/dist/dev/nutch/1.19/
>
> The release candidate is a zip and tar.gz archive of the binary and
> sources in:
>    https://github.com/apache/nutch/tree/release-1.19
>
> In addition, a staged maven repository is available here:
>https://repository.apache.org/content/repositories/orgapachenutch-1020
>
> We addressed 87 issues:
>https://s.apache.org/lf6li
>
>
> Please vote on releasing this package as Apache Nutch 1.19.
> The vote is open for the next 72 hours and passes if a majority
> of at least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.19.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Sebastian
> (On behalf of the Nutch PMC)
>
> P.S.
> Here is my +1.
> - tested most of Nutch tools and run a test crawl on a single-node cluster
>   running Hadoop 3.3.4, see
>   https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)
>


[VOTE] Release Apache Nutch 1.19 RC#1

2022-08-22 Thread Sebastian Nagel
Hi Folks,

A first candidate for the Nutch 1.19 release is available at:

   https://dist.apache.org/repos/dist/dev/nutch/1.19/

The release candidate is a zip and tar.gz archive of the binary and sources in:
   https://github.com/apache/nutch/tree/release-1.19

In addition, a staged maven repository is available here:
   https://repository.apache.org/content/repositories/orgapachenutch-1020

We addressed 87 issues:
   https://s.apache.org/lf6li


Please vote on releasing this package as Apache Nutch 1.19.
The vote is open for the next 72 hours and passes if a majority
of at least three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch 1.19.
[ ] -1 Do not release this package because…

Cheers,
Sebastian
(On behalf of the Nutch PMC)

P.S.
Here is my +1.
- tested most of Nutch tools and run a test crawl on a single-node cluster
  running Hadoop 3.3.4, see
  https://github.com/sebastian-nagel/nutch-test-single-node-cluster/)


Re: Question about Nutch plugins

2022-07-24 Thread Sebastian Nagel
Hi Rastko,

the description isn't really correct now as NUTCH_HOME is supposed to point to
the runtime

- if the binary package is used: this is the base folder of the package,
  eg. apache-nutch-1.18/

- if Nutch is built from the source, you usually point NUTCH_HOME to
  runtime/local/ - the directory tree below this folder looks pretty much
  the same as the binary package

Older versions of Nutch hadn't this separation of source and runtime.


> I use nutch by just unzipping apache-nutch-1.17-bin.tar.gz)?

If you want to build your own plugin, I'd recommend to start using
the Nutch source package, or even the current master by cloning the
Nutch git repository.


As always for a community project: feel free to improve the tutorial, obviously
it might be out of date.


Best,
Sebastian


On 7/23/22 13:28, Rastko.pavlovic wrote:
> Hi all,
> 
> I've been trying to implement this tutorial 
> https://cwiki.apache.org/confluence/display/nutch/WritingPluginExample on 
> Nutch 1.17. In several places, the tutorial refers to $NUTCH_HOME/src/plugin. 
> However, in my $NUTCH_HOME I only have a "plugin" directory and no src. If I 
> try building the plugin in a sub directory of the "plugin" directory with 
> ant, I get a problem where build.xml complains that it can't find 
> build-plugin.xml.
> 
> Does anyone maybe know what am I doing wrong (in case it helps, I use nutch 
> by just unzipping apache-nutch-1.17-bin.tar.gz)?
> 
> Many thanks in advance.
> 
> Best regards,
> Rastko
> 


Question about Nutch plugins

2022-07-23 Thread Rastko.pavlovic
Hi all,

I've been trying to implement this tutorial 
https://cwiki.apache.org/confluence/display/nutch/WritingPluginExample on Nutch 
1.17. In several places, the tutorial refers to $NUTCH_HOME/src/plugin. 
However, in my $NUTCH_HOME I only have a "plugin" directory and no src. If I 
try building the plugin in a sub directory of the "plugin" directory with ant, 
I get a problem where build.xml complains that it can't find build-plugin.xml.

Does anyone maybe know what am I doing wrong (in case it helps, I use nutch by 
just unzipping apache-nutch-1.17-bin.tar.gz)?

Many thanks in advance.

Best regards,
Rastko

Sent with [Proton Mail](https://proton.me/) secure email.

Re: Problem with Nutch <-> Eclipse

2022-07-21 Thread Robert Scavilla
Hello Sebastian, each time you helped me I am reminded how grateful I am
there are people like you!

I am following these instructions:
*https://cwiki.apache.org/confluence/display/nutch/RunNutchInEclipse
<https://cwiki.apache.org/confluence/display/nutch/RunNutchInEclipse>*

The most prevalent error is: *The package org.w3c.dom is accessible from
more than one module: , java.xml*
This message occurs for a few other packages as well, but I figure if you
can help me fix one, I will be more able to fix the others.

There is also an error in LuceneAnalyzerUtil.java : *STOP_WORDS_SET cannot
be resolved or is not a field*

*Enjoy your day and thank you,*
*...bob*

On Tue, Jul 19, 2022 at 2:48 AM Sebastian Nagel 
wrote:

> Hi Bob,
>
> could you share which instructions and when the error happens - during
> import,
> project build, running/debugging?
>
> The usual way is
>
> 1. to write the Eclipse project configuration, run
>
>ant eclipse
>
> 2. import the written project configuration into Eclipse
>
>
> Building or running/debugging Nutch in Eclipse is possible although
> requires
> some work to get everything right.
>
>
> Best,
> Sebastian
>
>
> On 7/15/22 23:00, Robert Scavilla wrote:
> > Hello Kind People, I am trying to set up Nutch with eclipse. I am
> following
> > the instructions and have an issue that I have not been able to resolve
> > yet. I have the error: *"package org.w3c.dom is accessible from more than
> > one module"*
> >
> > There are several modules that get this same error message. The project
> > compiles from the command line without error. It is not clear to me how
> to
> > resolve this and I hope you can help.
> >
> > Thank you!
> > ...bob
> >
>


Re: Problem with Nutch <-> Eclipse

2022-07-19 Thread Sebastian Nagel
Hi Bob,

could you share which instructions and when the error happens - during import,
project build, running/debugging?

The usual way is

1. to write the Eclipse project configuration, run

   ant eclipse

2. import the written project configuration into Eclipse


Building or running/debugging Nutch in Eclipse is possible although requires
some work to get everything right.


Best,
Sebastian


On 7/15/22 23:00, Robert Scavilla wrote:
> Hello Kind People, I am trying to set up Nutch with eclipse. I am following
> the instructions and have an issue that I have not been able to resolve
> yet. I have the error: *"package org.w3c.dom is accessible from more than
> one module"*
> 
> There are several modules that get this same error message. The project
> compiles from the command line without error. It is not clear to me how to
> resolve this and I hope you can help.
> 
> Thank you!
> ...bob
> 


Problem with Nutch <-> Eclipse

2022-07-15 Thread Robert Scavilla
Hello Kind People, I am trying to set up Nutch with eclipse. I am following
the instructions and have an issue that I have not been able to resolve
yet. I have the error: *"package org.w3c.dom is accessible from more than
one module"*

There are several modules that get this same error message. The project
compiles from the command line without error. It is not clear to me how to
resolve this and I hope you can help.

Thank you!
...bob


Re: Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-13 Thread Markus Jelsma
To add to Sebastian, it runs on Hadoop 3.3.x very good as well. Actually, i
never had any Hadoop version that could not run Nutch out of the box and
without issues.

Op ma 13 jun. 2022 om 11:54 schreef Sebastian Nagel
:

> Hi Michael,
>
> Nutch (1.18, and trunk/master) should work together with more recent Hadoop
> versions.
>
> At Common Crawl we use a modified Nutch version based on the recent trunk
> running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop
> cluster
> with x64 and arm64 AWS EC2 instances.
>
> But I'm sure there are more possible combinations.
>
> One important note: in trunk/master there is a yet unsolved regression
> caused by
> the newly introduced plugin-based URL stream handlers, see NUTCH-2936 and
> NUTCH-2949. Unless these are resolved, you need to undo these commits in
> order
> to run Nutch (built from trunk/master) in distributed mode.
>
> Best,
> Sebastian
>
> On 6/13/22 01:37, Michael Coffey wrote:
> > Do current 1.x versions of Nutch (1.18, and trunk/master) work with
> versions of Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from
> October 2019, and there are many newer versions available. For example,
> 3.1.4 came out in 2020, and there are 3.2.x and 3.3.x versions that came
> out this year.
> >
> > I don’t care about newer features in Hadoop, I just have general
> concerns about stability and security. I am working on reviving an old
> project and would like to put together the best possible infrastructure for
> the future.
> >
> >
>


Re: Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-13 Thread Sebastian Nagel
Hi Michael,

Nutch (1.18, and trunk/master) should work together with more recent Hadoop
versions.

At Common Crawl we use a modified Nutch version based on the recent trunk
running on Hadoop 3.2.2 (soon 3.2.3) and Java 11, even on a mixed Hadoop cluster
with x64 and arm64 AWS EC2 instances.

But I'm sure there are more possible combinations.

One important note: in trunk/master there is a yet unsolved regression caused by
the newly introduced plugin-based URL stream handlers, see NUTCH-2936 and
NUTCH-2949. Unless these are resolved, you need to undo these commits in order
to run Nutch (built from trunk/master) in distributed mode.

Best,
Sebastian

On 6/13/22 01:37, Michael Coffey wrote:
> Do current 1.x versions of Nutch (1.18, and trunk/master) work with versions 
> of Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from October 
> 2019, and there are many newer versions available. For example, 3.1.4 came 
> out in 2020, and there are 3.2.x and 3.3.x versions that came out this year.
> 
> I don’t care about newer features in Hadoop, I just have general concerns 
> about stability and security. I am working on reviving an old project and 
> would like to put together the best possible infrastructure for the future.
> 
> 


Does Nutch work with Hadoop Versions greater than 3.1.3?

2022-06-12 Thread Michael Coffey
Do current 1.x versions of Nutch (1.18, and trunk/master) work with versions of 
Hadoop greater than 3.1.3? I ask because Hadoop 3.1.3 is from October 2019, and 
there are many newer versions available. For example, 3.1.4 came out in 2020, 
and there are 3.2.x and 3.3.x versions that came out this year.

I don’t care about newer features in Hadoop, I just have general concerns about 
stability and security. I am working on reviving an old project and would like 
to put together the best possible infrastructure for the future.



RE: Nutch not crawling all URLs

2022-02-16 Thread Roseline Antai
Hi,

Just continuing this thread, I tried the Selenium plugin as suggested below. I 
have copied over the nutch-site.xml file to show the parameters set for the 
selenium plugin below. I have taken most of the descriptions out for brevity:








  http.agent.name
  Esid Crawler


  http.agent.email
  roselineantai at gmail dot com


  http.agent.url
  http://esid.shinyapps.io/ESID/ 


  db.ignore.also.redirects
  false
  I
  


  db.fetch.interval.default
  30
  The default number of seconds between re-fetches of a page (30 
days).
  


db.ignore.internal.links
false


db.ignore.external.links
true


parser.skip.truncated
false
Boolean value for whether we should skip parsing for truncated 
documents. By default this
property is activated due to extremely high levels of CPU which parsing 
can sometimes take.


 
   db.max.outlinks.per.page
   -1
   
   
 

  http.content.limit
  -1
  
  


  db.ignore.external.links.mode
  byHost


  db.injector.overwrite
  true



  http.timeout
  10
  The default network timeout, in milliseconds.


plugin.includes

protocol-selenium|urlfilter-regex|parse-tika|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier




  selenium.driver
  chrome
  
  



  selenium.take.screenshot
  false
  




  selenium.screenshot.location
  
  
  



  selenium.hub.port
  
  Selenium Hub Location connection port



  selenium.hub.path
  /wd/hub
  Selenium Hub Location connection path



  selenium.hub.host
  localhost
  Selenium Hub Location connection host




  selenium.hub.protocol
  http
  Selenium Hub Location connection protocol



  selenium.grid.driver
  chrome
   



  selenium.grid.binary
  /usr/bin/chromedriver
  
 




  libselenium.page.load.delay
  3
  



  webdriver.chrome.driver
  /root/chromedriver
  The path to the ChromeDriver binary




  selenium.enable.headless
  true
  A Boolean value representing the headless option
for Firefix and Chrome drivers
  




When I tested the setup using this: 

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL

With some of the problematic URLs, they all came out well on the console. They 
however were quite a number of URLs identified as outlinks. But when I run the 
full crawl with this plug-in, it appears to show some data in Solr, but I have 
been unable to extract any data. It gives '0' as count of what has been 
crawled, for all the URLs. This is quite worrying, because without the plugin, 
I did manage to get data from about half of the URLs. The performance is way 
worse than it should be. I'm also confused because testing some of the sites 
with the example I was given above works.

Below is a sample of the errors I got from the log files. Please have a look at 
them and let me know if there is a parameter I'm not setting properly:

2022-02-15 01:49:02,093 ERROR tika.TikaParser - Problem loading custom Tika 
configuration from tika-config.xml

java.lang.NumberFormatException: For input string: ""

2022-02-15 13:29:21,331 ERROR selenium.Http - Failed to get protocol output
java.lang.RuntimeException: org.openqa.selenium.WebDriverException: unknown 
error: net::ERR_NAME_NOT_RESOLVED
  (Session info: headless chrome=96.0.4664.110)
  
Caused by: org.openqa.selenium.WebDriverException: unknown error: 
net::ERR_NAME_NOT_RESOLVED
  (Session info: headless chrome=96.0.4664.110)
  
*** Element info: {Using=tag name, value=body}
2022-02-15 13:29:23,971 ERROR selenium.Http - Failed to get protocol output
java.lang.RuntimeException: org.openqa.selenium.NoSuchElementException: no such 
element: Unable to locate element: {"method":"css selector","selector":"body"}
  (Session info: headless chrome=96.0.4664.110)
For documentation on this error, please visit: 
http://seleniumhq.org/exceptions/no_such_element.html

2022-02-15 13:29:23,972 INFO  fetcher.FetcherThread - FetcherThread 71 fetch of 
http://ialab.com.ar/ failed with: java.lang.RuntimeException: 
org.openqa.selenium.NoSuchEle>  (Session info: headless chrome=96.0.4664.110)
For documentation on this error, please visit: 
http://seleniumhq.org/exceptions/no_such_element.html

2022-02-15 13:29:27,648 ERROR selenium.HttpWebClient - Selenium WebDriver: 
Timeout Exception: Capturing whatever loaded so far...

2022-02-15 13:32:42,713 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'outlink', using default

2022-02-15 13:33:23,664 INFO  regex.RegexURLNormalizer - can't find rules for 
scope 'fetcher', using default

2022-02-15 13:36:23,347 INFO  parse.ParserFactory - The parsing plugins: 
[org.apache.nutch.parse.tika.TikaParser] are enabled via the plugin.includes 
system property, and >2022-02-15 13:36:23,

RE: Nutch not crawling all URLs

2022-01-13 Thread Roseline Antai
Thank you Sebastian.

I will try these.

Kind regards,
Roseline



-Original Message-
From: Sebastian Nagel  
Sent: 13 January 2022 12:33
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

Hi Roseline,

> Does it work at all with Chrome?

Yes.

> It seems you need to have some form of GUI to run it?

You need graphics libraries but not necessarily a graphical system.
Normally, you run the browser in headless mode without a graphical device 
(monitor) attached.

> Is there some documentation or tutorial on this?

The README is probably the best documentation:
  src/plugin/protocol-selenium/README.md
  
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2Fapache%2Fnutch%2Ftree%2Fmaster%2Fsrc%2Fplugin%2Fprotocol-seleniumdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C32d425eaebf34b01ecb008d9d690d0a1%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776741438542976%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=ybLqkJoR3ZMsMSQO7cB3cvdnFk%2F3%2F9JDDds0yA%2BpyVk%3Dreserved=0

After installing chromium and the Selenium chromedriver, you can test whether 
it works by running:

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL


Caveat: because browsers are updated frequently, you may need to use a recent 
driver version and eventually also upgrade the Selenium dependencies in Nutch.
Let us know if you need help here.


> My use case is Text mining  and Machine Learning classification. I'm 
> indexing into Solr and then transferring the indexed data to MongoDB 
> for further processing.

Well, that's not an untypical use case for Nutch. And it's a long pipeline:
fetching, HTML parsing, extracting content fields, indexing. Nutch is able to 
perform all steps. But I'd agree that browser-based crawling isn't that easy to 
set up with Nutch.

Best,
Sebastian

On 1/12/22 17:53, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you. I did enjoy the holiday. Hope you did too. 
> 
> I have had a look at the protocol-selenium plugin, but it was a bit difficult 
> to understand. It appears it only works with Firefox. Does it work at all 
> with Chrome? I was also not sure of what values to set for the properties. It 
> seems you need to have some form of GUI to run it?
> 
> Is there some documentation or tutorial on this? My guess is that some of the 
> pages might not be crawling because of JavaScript. I might be wrong, but 
> would want to test that.
> 
> I think would be quite good for my use case because I am trying to implement 
> broad crawling. 
> 
> My use case is Text mining  and Machine Learning classification. I'm indexing 
> into Solr and then transferring the indexed data to MongoDB for further 
> processing.
> 
> Kind regards,
> Roseline
> 
> 
> 
> 
> 
> -Original Message-----
> From: Sebastian Nagel 
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> Hi Roseline,
> 
>> the mail below went to my junk folder and I didn't see it.
> 
> No problem. I hope you nevertheless enjoyed the holidays.
> And sorry for any delays but I want to emphasize that Nutch is a community 
> project and in doubt it might take a few days until somebody finds the time 
> to respond.
> 
>> Could you confirm if you received all the urls I sent?
> 
> I've tried a view URLs you sent but not all of them. And to figure out the 
> reason why a site isn't crawled may take some time.
> 
>> Another question I have about Nutch is if it has problems with 
>> crawling javascript pages?
> 
> By default Nutch does not execute Javascript.
> 
> There is a protocol plugin (protocol-selenium) to fetch pages with a web 
> browser between Nutch and the crawled sites. This way Javascript pages can be 
> crawled for the price of some overhead in setting up the crawler and network 
> traffic to fetch the page dependencies (CSS, Javascript, images).
> 
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
> 
> Well, Nutch is for sure a good crawler. But as always: there are many other 
> crawlers which might be better adapted to a specific use case.
> 
> What's your use case? Indexing into Solr or Elasticsearch?
> Text mining? Archiving content?
> 
> Best,
> Sebastian
> 
> On 1/12/22 12:13, Roseline Antai wrote:
>> Hi Sebastian,
>>
>> For some reason, the mail below went to my junk folder and I didn't see it.
>>
>> The notco page - 
>> https://eur02.safelinks.pro

Re: Nutch not crawling all URLs

2022-01-13 Thread Sebastian Nagel
Hi Roseline,

> Does it work at all with Chrome?

Yes.

> It seems you need to have some form of GUI to run it?

You need graphics libraries but not necessarily a graphical system.
Normally, you run the browser in headless mode without a graphical
device (monitor) attached.

> Is there some documentation or tutorial on this?

The README is probably the best documentation:
  src/plugin/protocol-selenium/README.md
  https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium

After installing chromium and the Selenium chromedriver, you can test whether it
works by running:

bin/nutch parsechecker \
  -Dplugin.includes='protocol-selenium|parse-tika' \
  -Dselenium.grid.binary=/path/to/selenium/chromedriver  \
  -Dselenium.driver=chrome \
  -Dselenium.enable.headless=true \
  -followRedirects -dumpText  URL


Caveat: because browsers are updated frequently, you may need to use a recent
driver version and eventually also upgrade the Selenium dependencies in Nutch.
Let us know if you need help here.


> My use case is Text mining  and Machine Learning classification. I'm indexing
> into Solr and then transferring the indexed data to MongoDB for further
> processing.

Well, that's not an untypical use case for Nutch. And it's a long pipeline:
fetching, HTML parsing, extracting content fields, indexing. Nutch is able to
perform all steps. But I'd agree that browser-based crawling isn't that easy
to set up with Nutch.

Best,
Sebastian

On 1/12/22 17:53, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you. I did enjoy the holiday. Hope you did too. 
> 
> I have had a look at the protocol-selenium plugin, but it was a bit difficult 
> to understand. It appears it only works with Firefox. Does it work at all 
> with Chrome? I was also not sure of what values to set for the properties. It 
> seems you need to have some form of GUI to run it?
> 
> Is there some documentation or tutorial on this? My guess is that some of the 
> pages might not be crawling because of JavaScript. I might be wrong, but 
> would want to test that.
> 
> I think would be quite good for my use case because I am trying to implement 
> broad crawling. 
> 
> My use case is Text mining  and Machine Learning classification. I'm indexing 
> into Solr and then transferring the indexed data to MongoDB for further 
> processing.
> 
> Kind regards,
> Roseline
> 
> 
> 
> 
> 
> -Original Message-----
> From: Sebastian Nagel  
> Sent: 12 January 2022 16:12
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> Hi Roseline,
> 
>> the mail below went to my junk folder and I didn't see it.
> 
> No problem. I hope you nevertheless enjoyed the holidays.
> And sorry for any delays but I want to emphasize that Nutch is a community 
> project and in doubt it might take a few days until somebody finds the time 
> to respond.
> 
>> Could you confirm if you received all the urls I sent?
> 
> I've tried a view URLs you sent but not all of them. And to figure out the 
> reason why a site isn't crawled may take some time.
> 
>> Another question I have about Nutch is if it has problems with 
>> crawling javascript pages?
> 
> By default Nutch does not execute Javascript.
> 
> There is a protocol plugin (protocol-selenium) to fetch pages with a web 
> browser between Nutch and the crawled sites. This way Javascript pages can be 
> crawled for the price of some overhead in setting up the crawler and network 
> traffic to fetch the page dependencies (CSS, Javascript, images).
> 
>> I would ideally love to make the crawler work for my URLs than start 
>> checking for other crawlers and waste all the work so far.
> 
> Well, Nutch is for sure a good crawler. But as always: there are many other 
> crawlers which might be better adapted to a specific use case.
> 
> What's your use case? Indexing into Solr or Elasticsearch?
> Text mining? Archiving content?
> 
> Best,
> Sebastian
> 
> On 1/12/22 12:13, Roseline Antai wrote:
>> Hi Sebastian,
>>
>> For some reason, the mail below went to my junk folder and I didn't see it.
>>
>> The notco page - 
>> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3Dreserved=0
>>   was not indexed, no. When I enabled redirects, I was able to get a few 
>> pages, but they don't seem valid.
>>
>> Could you confirm if you received all the urls I sent

RE: Nutch not crawling all URLs

2022-01-12 Thread Roseline Antai
Hi Sebastian,

Thank you. I did enjoy the holiday. Hope you did too. 

I have had a look at the protocol-selenium plugin, but it was a bit difficult 
to understand. It appears it only works with Firefox. Does it work at all with 
Chrome? I was also not sure of what values to set for the properties. It seems 
you need to have some form of GUI to run it?

Is there some documentation or tutorial on this? My guess is that some of the 
pages might not be crawling because of JavaScript. I might be wrong, but would 
want to test that.

I think would be quite good for my use case because I am trying to implement 
broad crawling. 

My use case is Text mining  and Machine Learning classification. I'm indexing 
into Solr and then transferring the indexed data to MongoDB for further 
processing.

Kind regards,
Roseline





-Original Message-
From: Sebastian Nagel  
Sent: 12 January 2022 16:12
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

Hi Roseline,

> the mail below went to my junk folder and I didn't see it.

No problem. I hope you nevertheless enjoyed the holidays.
And sorry for any delays but I want to emphasize that Nutch is a community 
project and in doubt it might take a few days until somebody finds the time to 
respond.

> Could you confirm if you received all the urls I sent?

I've tried a view URLs you sent but not all of them. And to figure out the 
reason why a site isn't crawled may take some time.

> Another question I have about Nutch is if it has problems with 
> crawling javascript pages?

By default Nutch does not execute Javascript.

There is a protocol plugin (protocol-selenium) to fetch pages with a web 
browser between Nutch and the crawled sites. This way Javascript pages can be 
crawled for the price of some overhead in setting up the crawler and network 
traffic to fetch the page dependencies (CSS, Javascript, images).

> I would ideally love to make the crawler work for my URLs than start 
> checking for other crawlers and waste all the work so far.

Well, Nutch is for sure a good crawler. But as always: there are many other 
crawlers which might be better adapted to a specific use case.

What's your use case? Indexing into Solr or Elasticsearch?
Text mining? Archiving content?

Best,
Sebastian

On 1/12/22 12:13, Roseline Antai wrote:
> Hi Sebastian,
> 
> For some reason, the mail below went to my junk folder and I didn't see it.
> 
> The notco page - 
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7Cae7544cf983445bf72b108d9d5e66484%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637776009124020328%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2BZq7R6H954Q9u6Xt%2FnkeHYEKjx4rhFF62PvP2dQEW5U%3Dreserved=0
>   was not indexed, no. When I enabled redirects, I was able to get a few 
> pages, but they don't seem valid.
> 
> Could you confirm if you received all the urls I sent?
> 
> Another question I have about Nutch is if it has problems with crawling 
> javascript pages?
> 
> I would ideally love to make the crawler work for my URLs than start checking 
> for other crawlers and waste all the work so far.
> 
> Just adding again, this is what my nutch-site.xml looks like:
> 
> 
> 
> 
> 
> 
> 
>  http.agent.name
>  Nutch Crawler
> 
> 
> http.agent.email 
> datalake.ng at gmail d  
> db.ignore.internal.links
> false
> 
> 
> db.ignore.external.links
> true
> 
> 
>   plugin.includes
>   
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
> age-identifier
> 
> 
> parser.skip.truncated
> false
> Boolean value for whether we should skip parsing for 
> truncated documents. By default this
> property is activated due to extremely high levels of CPU which 
> parsing can sometimes take.
> 
> 
>  
>db.max.outlinks.per.page
>-1
>The maximum number of outlinks that we'll process for a page.
>If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks
>will be processed for a page; otherwise, all outlinks will be processed.
>
>  
> 
>   http.content.limit
>   -1
>   The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   
> 
> 
>   db.ignore.external.links.mode
>   byHost
> 
> 
>   db.injector.overwrite
>   true
> 
> 
>   http.timeout
>   5
>   The default network time

Re: Nutch not crawling all URLs

2022-01-12 Thread Sebastian Nagel
Hi Roseline,

> the mail below went to my junk folder and I didn't see it.

No problem. I hope you nevertheless enjoyed the holidays.
And sorry for any delays but I want to emphasize that Nutch is
a community project and in doubt it might take a few days
until somebody finds the time to respond.

> Could you confirm if you received all the urls I sent?

I've tried a view URLs you sent but not all of them. And to figure out the
reason why a site isn't crawled may take some time.

> Another question I have about Nutch is if it has problems with crawling
> javascript pages?

By default Nutch does not execute Javascript.

There is a protocol plugin (protocol-selenium) to fetch pages with a web
browser between Nutch and the crawled sites. This way Javascript pages
can be crawled for the price of some overhead in setting up the crawler and
network traffic to fetch the page dependencies (CSS, Javascript, images).

> I would ideally love to make the crawler work for my URLs than start checking
> for other crawlers and waste all the work so far.

Well, Nutch is for sure a good crawler. But as always: there are many
other crawlers which might be better adapted to a specific use case.

What's your use case? Indexing into Solr or Elasticsearch?
Text mining? Archiving content?

Best,
Sebastian

On 1/12/22 12:13, Roseline Antai wrote:
> Hi Sebastian,
> 
> For some reason, the mail below went to my junk folder and I didn't see it.
> 
> The notco page - https://notco.com/  was not indexed, no. When I enabled 
> redirects, I was able to get a few pages, but they don't seem valid.
> 
> Could you confirm if you received all the urls I sent?
> 
> Another question I have about Nutch is if it has problems with crawling 
> javascript pages?
> 
> I would ideally love to make the crawler work for my URLs than start checking 
> for other crawlers and waste all the work so far.
> 
> Just adding again, this is what my nutch-site.xml looks like:
> 
> 
> 
> 
> 
> 
> 
>  http.agent.name
>  Nutch Crawler
> 
> 
> http.agent.email 
> datalake.ng at gmail d 
> 
> 
> db.ignore.internal.links
> false
> 
> 
> db.ignore.external.links
> true
> 
> 
>   plugin.includes
>   
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier
> 
> 
> parser.skip.truncated
> false
> Boolean value for whether we should skip parsing for 
> truncated documents. By default this
> property is activated due to extremely high levels of CPU which 
> parsing can sometimes take.
> 
> 
>  
>db.max.outlinks.per.page
>-1
>The maximum number of outlinks that we'll process for a page.
>If this value is nonnegative (>=0), at most db.max.outlinks.per.page 
> outlinks
>will be processed for a page; otherwise, all outlinks will be processed.
>
>  
> 
>   http.content.limit
>   -1
>   The length limit for downloaded content using the http://
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>   than it will be truncated; otherwise, no truncation at all. Do not
>   confuse this setting with the file.content.limit setting.
>   
> 
> 
>   db.ignore.external.links.mode
>   byHost
> 
> 
>   db.injector.overwrite
>   true
> 
> 
>   http.timeout
>   5
>   The default network timeout, in milliseconds.
> 
> 
> 
> Regards,
> Roseline
> 
> -Original Message-
> From: Sebastian Nagel  
> Sent: 13 December 2021 17:35
> To: user@nutch.apache.org
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3Dreserved=0
> 
> What is the status for   
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhichdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3Dreserved=0
>  is the final redirect
> target?
> Is the target page indexed?
> 
> ~Sebastian
> 


RE: Nutch not crawling all URLs

2022-01-12 Thread Roseline Antai
Hi Sebastian,

For some reason, the mail below went to my junk folder and I didn't see it.

The notco page - https://notco.com/  was not indexed, no. When I enabled 
redirects, I was able to get a few pages, but they don't seem valid.

Could you confirm if you received all the urls I sent?

Another question I have about Nutch is if it has problems with crawling 
javascript pages?

I would ideally love to make the crawler work for my URLs than start checking 
for other crawlers and waste all the work so far.

Just adding again, this is what my nutch-site.xml looks like:







 http.agent.name
 Nutch Crawler


http.agent.email 
datalake.ng at gmail d 


db.ignore.internal.links
false


db.ignore.external.links
true


  plugin.includes
  
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier


parser.skip.truncated
false
Boolean value for whether we should skip parsing for truncated 
documents. By default this
property is activated due to extremely high levels of CPU which parsing 
can sometimes take.


 
   db.max.outlinks.per.page
   -1
   The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   
 

  http.content.limit
  -1
  The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  


  db.ignore.external.links.mode
  byHost


  db.injector.overwrite
  true


  http.timeout
  5
  The default network timeout, in milliseconds.



Regards,
Roseline

-Original Message-
From: Sebastian Nagel  
Sent: 13 December 2021 17:35
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,

> 5,36405,0,https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fwww.notco.com%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=rPjrY5Lr3LWwK0%2BB%2FOibIDmKHGjvQRntpN6jCb4iZRs%3Dreserved=0

What is the status for   
https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fnotco.com%2Fwhichdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C258445a075aa43faa5e908d9be5ee02f%7C631e0763153347eba5cd0457bee5944e%7C0%7C0%7C637750137990569166%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=%2FAsVkcpGQhNGDGvpdZ7stxEaPM%2BQlrEfsWhZOnJEhZQ%3Dreserved=0
 is the final redirect
target?
Is the target page indexed?

~Sebastian


!! Join the #nutch Slack channel !!

2021-12-29 Thread lewis john mcgibbney
Hi user@, dev@,
I took the liberty of setting up a #nutch channel for our community to
communicate in a lower latency manner.
First join the-asf.slack.com Slack workspace
https://infra.apache.org/slack.html
Then simply join the #nutch channel.
See you there :)
Thanks
lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


RE: Nutch not crawling all URLs

2021-12-15 Thread Roseline Antai
Hi,

Following on from my previous enquiry, I was told to send the URLs I was trying 
to crawl to be tried from your end. I sent these, but did not receive any 
confirmation of receipt. Can you please confirm if these have been received, 
and when I can look forward to getting some feedback?

I re-crawled the 20 URLs again and reset these values to the default values 
from the nutch-default.xml file:


property>

  fetcher.server.delay

  65.0







  fetcher.server.min.delay

  25.0







 fetcher.max.crawl.delay

 70





I then set the ignore external links to false, as below:



db.ignore.external.links
false




I set the following property to 'true' still:





  db.ignore.also.redirects

  true

  If true, the fetcher checks redirects the same way as

  links when ignoring internal or external links. Set to false to

 follow redirects despite the values for db.ignore.external.links and

  db.ignore.internal.links.

  





13 URLs were fetched, but of these, the URLs that were originally not fetched 
returned very few pages related to the domain in the URL, and this makes me 
question the crawl.



Also, when external links are not ignored, the crawler does go off onto 
different sites, like Wikipedia, news sites, etc. This is hardly efficient as 
it spends so long on the crawl fetching irrelevant pages. How can this be 
controlled in Nutch? If crawling up to 900 URLs as we are going to be doing, 
will we have to write regex expressions for each URL  in the regex-urlfilter in 
order to stick to the domains in the URL?



There is no explicit documentation on how to do this in Nutch, unless I have 
missed it?



Is there something that should be done that I'm not doing, or is Nutch just 
incapable of efficient crawling?



Regards,

Roseline


When I crawled,

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK

[Small eMail Sig]
The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


From: Roseline Antai
Sent: 13 December 2021 12:02
To: 'user@nutch.apache.org' 
Subject: Nutch not crawling all URLs

Hi,

I am working with Apache nutch 1.18 and Solr. I have set up the system 
successfully, but I'm now having the problem that Nutch is refusing to crawl 
all the URLs. I am now at a loss as to what I should do to correct this 
problem. It fetches about half of the URLs in the seed.txt file.

For instance, when I inject 20 URLs, only 9 are fetched. I have made a number 
of changes based on the suggestions I saw on the Nutch forum, as well as on 
Stack overflow, but nothing seems to work.

This is what my nutch-site.xml file looks like:









http.agent.name
Nutch Crawler


http.agent.email
datalake.ng at gmail d


db.ignore.internal.links
false


db.ignore.external.links
true


  plugin.includes
  
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier


parser.skip.truncated
false
Boolean value for whether we should skip parsing for truncated 
documents. By default this
property is activated due to extremely high levels of CPU which parsing 
can sometimes take.



   db.max.outlinks.per.page
   -1
   The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   


  http.content.limit
  -1
  The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  


  db.ignore.external.links.mode
  byDomain


  db.injector.overwrite
  true


  http.timeout
  5
  The default network timeout, in milliseconds.



Other changes I have made include changing the following in nutch-default.xml:

property>
  http.redirect.max
  2
  The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  

**




  ftp.timeoutftp://ftp.timeout>

  10







  ftp.server.timeoutftp://ftp.server.timeout>

  15



*


property>

  fetcher.server.delay

  65.0







  fetcher.server.min.delay

  25.0







 fetcher.max.crawl.delay

 70



I also commented out the line below in the regex-urlfilter file:


# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

Nothing seems to work.

What is it that I'm not doing, or doing wrongly here?

Regards,
Roseline

Dr Roseline Antai
Research

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun
Hi Sebastian,

yes that I mean. Do you think there is a way to learn more about,
how to crawl any website?!

>Hi Ayhan,

>you mean?
>https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt<https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt>



Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Ayhan,

you mean?
https://stackoverflow.com/questions/69352136/nutch-does-not-crawl-sites-that-allows-all-crawler-by-robots-txt

Sebastian

On 12/13/21 20:59, Ayhan Koyun wrote:
> Hi,
> 
> as I wrote before, it seems that I am not the only one who can not crawl all 
> the seed.txt url's. I couldn't
> find a solution really. I collected 450 domains and approximately 200 nutch 
> will or can not crawl. I want to
> know why this happens, is there a solution to force crawling sites?
> 
> It would be great to get a satisfying answer, to know why this happens and 
> maybe how to solve it.
> 
> Thanks in advance
> 
> Ayhan
> 
> 


RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi Lewis,

I got a really weird reply back from what I sent, so I thought it better to 
resend the URLs again. I'm unsure if you got the URLs in the first instance.

I've sent them as a text file attachment as well.

http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
http://www.1doc3.com
http://www.amanda-care.com
http://www.unimadx.com
http://www.upch.edu.pe/bioinformatic/anemia/app/
http://www.u-planner.com
http://alerce.science
http://paraempleo.mtess.gov.py
http://layers.hemav.com
http://www.sisben.gov.co
http://ialab.com.ar
http://www.kilimo.com.ar
https://www.facebook.com/CIRSYS
http://www.dymaxionlabs.com
http://cedo.org

Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-Original Message-
From: lewis john mcgibbney  
Sent: 13 December 2021 17:18
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,
Looks like you are ignoring external URLs... that could be the problem right 
there.
I encourage you to track counters on inject, generate and fetch phases to 
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we can 
try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org To unsubscribe, 
> e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +0000
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system 
> successfully, but I'm now having the problem that Nutch is refusing to 
> crawl all the URLs. I am now at a loss as to what I should do to 
> correct this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
> number of changes based on the suggestions I saw on the Nutch forum, 
> as well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name 
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhttp
> .agent.name%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3
> 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C
> 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Or%2Ft4Sp
> S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3Dreserved=0>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
> chor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|langu
> age-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fdb.m
> ax.outlinks.per.page%2Fdata=04%7C01%7Croseline.antai%40strath.ac.
> uk%7C3ed3407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee594
> 4e%7C0%7C0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLj
> AwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=
> hSYwQY8gfRV8uPs5X5jYS

Re: Nutch not crawling all URLs

2021-12-13 Thread Ayhan Koyun
Hi,

as I wrote before, it seems that I am not the only one who can not crawl all 
the seed.txt url's. I couldn't
find a solution really. I collected 450 domains and approximately 200 nutch 
will or can not crawl. I want to
know why this happens, is there a solution to force crawling sites?

It would be great to get a satisfying answer, to know why this happens and 
maybe how to solve it.

Thanks in advance

Ayhan



Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi Roseline,

> 5,36405,0,http://www.notco.com

What is the status for   https://notco.com/which is the final redirect
target?
Is the target page indexed?

~Sebastian


RE: Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi Lewis,

Yes, there are public websites. Below are the 20 test URLs I've been trying to 
crawl.

http://traivefinance.com
http://www.ceibal.edu.uy
http://www.talovstudio.com
https://portaltelemedicina.com.br/en/telediagnostic-platform
http://www.notco.com
http://www.saiph.org
http://www.1doc3.com
http://www.amanda-care.com
http://www.unimadx.com
http://www.upch.edu.pe/bioinformatic/anemia/app/
http://www.u-planner.com
http://alerce.science
http://paraempleo.mtess.gov.py
http://layers.hemav.com
http://www.sisben.gov.co
http://ialab.com.ar
http://www.kilimo.com.ar
https://www.facebook.com/CIRSYS
http://www.dymaxionlabs.com
http://cedo.org


This is a count of the pages for the URLs crawled and not crawled. As can be 
seen, some are very large, while some are '0'.


,Project_id,Document Length,url
0,36400,0, http://www.trapview.com/v2/en/
1,36401,0,http://traivefinance.com
2,36402,2344075,http://www.ceibal.edu.uy
3,36403,35072,http://www.talovstudio.com
4,36404,1384658,https://portaltelemedicina.com.br/en/telediagnostic-platform
5,36405,0,http://www.notco.com
6,36406,0,http://www.saiph.org
7,36407,246009,http://www.1doc3.com
8,36408,43190,http://www.amanda-care.com
9,36409,0,http://www.unimadx.com
10,36410,0,http://www.upch.edu.pe/bioinformatic/anemia/app/
11,36411,0,http://www.u-planner.com
12,36412,8084,http://alerce.science
13,36413,0,http://paraempleo.mtess.gov.py
14,36414,0,http://layers.hemav.com
15,36415,0,http://www.sisben.gov.co
16,36416,3794113,http://ialab.com.ar
17,36417,0,http://www.kilimo.com.ar
18,36418,0,https://www.facebook.com/CIRSYS
19,36419,49062,http://www.dymaxionlabs.com
20,36420,1281267,http://cedo.org


Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK


The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.


-Original Message-
From: lewis john mcgibbney  
Sent: 13 December 2021 17:18
To: user@nutch.apache.org
Subject: Re: Nutch not crawling all URLs

CAUTION: This email originated outside the University. Check before clicking 
links or attachments.

Hi Roseline,
Looks like you are ignoring external URLs... that could be the problem right 
there.
I encourage you to track counters on inject, generate and fetch phases to 
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we can 
try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org To unsubscribe, 
> e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +0000
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system 
> successfully, but I'm now having the problem that Nutch is refusing to 
> crawl all the URLs. I am now at a loss as to what I should do to 
> correct this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
> number of changes based on the suggestions I saw on the Nutch forum, 
> as well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name 
> <https://eur02.safelinks.protection.outlook.com/?url=http%3A%2F%2Fhttp
> .agent.name%2Fdata=04%7C01%7Croseline.antai%40strath.ac.uk%7C3ed3
> 407a9d4e488c0cc008d9be5c91ff%7C631e0763153347eba5cd0457bee5944e%7C0%7C
> 0%7C637750127056879003%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000sdata=Or%2Ft4Sp
> S%2BtOnYXTPPXvnlEdYHapSd84pJU4klj9Tkkg%3Dreserved=0>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|an
>

Re: Nutch not crawling all URLs

2021-12-13 Thread lewis john mcgibbney
Hi Roseline,
Looks like you are ignoring external URLs… that could be the problem right
there.
I encourage you to track counters on inject, generate and fetch phases to
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we
can try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +0000
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name <http://http.agent.name>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> <http://db.max.outlinks.per.page>*
>
> *   -1*
>
> *   The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> <http://db.max.outlinks.per.page> outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   *
>
> **
>
> **
>
> *  http.content.limit*
>
> *  -1*
>
> *  The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  *
>
> **
>
> **
>
> *  db.ignore.external.links.mode*
>
> *  byDomain*
>
> **
>
> **
>
> *  db.injector.overwrite*
>
> *  true*
>
> **
>
> **
>
> *  http.timeout*
>
> *  5*
>
> *  The default network timeout, in
> milliseconds.*
>
> **
>
> **
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  http.redirect.max*
>
> *  2*
>
> *  The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  *
>
> **
>
> 
>
>
>
> **
>
> *  ftp.timeout*
>
> *  10*
>
> **
>
>
>
> **
>
> *  ftp.server.timeout*
>
> *  15*
>
> **
>
>
>
> ***
>
>
>
> *property>*
>
> *  fetcher.server.delay*
>
> *  65.0*
>
> **
>
>
>
> **
>
> *  fetcher.server.min.delay*
>
> *  25.0*
>
> **
>
>
>
> **
>
> * fetcher.max.crawl.delay*
>
> * 70*
>
> * *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Nagel
Hi,

(looping back to user@nutch - sorry, pressed the wrong reply button)

> Some URLs were denied by robots.txt,
> while a few failed with: Http code=403

That's two ways to signalize that these pages shouldn't be crawled,
HTTP 403 means "Forbidden".

> 3. I looked in CrawlDB and most URLs are in there, but were not
> crawled, so this is something that I find very confusing.

The CrawlDb contains also URLs which failed for various reasons.
That's important in order to avoid that 404s, 403s etc. are retried
again and again.

> I also ran some of the URLs that were not crawled through this -
>  bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
>
> Some of the URLs that failed were parsed successfully, so I'm really
> confused as to why there are no results for them.
>

The "HTTP 403 Forbidden" could be from a "anti-bot protection" software.
If you run parsechecker at a different time or from a different machine,
and not repeatedly or too often it may succeed.

Best,
Sebastian

On 12/13/21 17:48, Roseline Antai wrote:
> Hi Sebastian,
> 
> Thank you for your reply.
> 
> 1. All URLs were injected, so 20 in total. None was rejected.
> 
> 2. I've had a look at the log files and I can see that some of the URLs could 
> not be fetched because the robot.txt file could not be found. Would this be a 
> reason for why the fetch failed? Is there a way to go around it?
> 
> Some URLs were denied by robots.txt, while a few failed with: Http code=403 
> 
> 3. I looked in CrawlDB and most URLs are in there, but were not crawled, so 
> this is something that I find very confusing.
> 
> I also ran some of the URLs that were not crawled through this - bin/nutch 
> parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> Some of the URLs that failed were parsed successfully, so I'm really confused 
> as to why there are no results for them.
> 
> Do you have any suggestions on what I should try?
> 
> Dr Roseline Antai
> Research Fellow
> Hunter Centre for Entrepreneurship
> Strathclyde Business School
> University of Strathclyde, Glasgow, UK
> 
> 
> The University of Strathclyde is a charitable body, registered in Scotland, 
> number SC015263.
> 
> 
> -Original Message-
> From: Sebastian Nagel  
> Sent: 13 December 2021 12:19
> To: Roseline Antai 
> Subject: Re: Nutch not crawling all URLs
> 
> CAUTION: This email originated outside the University. Check before clicking 
> links or attachments.
> 
> Hi Roseline,
> 
>> For instance, when I inject 20 URLs, only 9 are fetched.
> 
> Are there any log messages about the 11 unfetched URLs in the log files.  Try 
> to look for a file "hadoop.log"
> (usually in $NUTCH_HOME/logs/) and look
>  1. how many URLs have been injected.
> There should be a log message
>  ... Total new urls injected: ...
>  2. If all 20 URLs are injected, there should be log
> messages about these URLs from the fetcher:
>  FetcherThread ... fetching ...
> If the fetch fails, there might be a message about
> this.
>  3. Look into the CrawlDb for the missing URLs.
>   bin/nutch readdb .../crawldb -url 
> or
>   bin/nutch readdb .../crawldb -dump ...
> You get the command-line options by calling
>   bin/nutch readdb
> without any arguments
> 
> Alternatively, verify fetching and parsing the URLs by
>   bin/nutch parsechecker -followRedirects -checkRobotsTxt https://myUrl
> 
> 
>> 
>> db.ignore.external.links
>> true
>> 
> 
> Eventually, you want to follow redirects anyway? See
> 
> 
>   db.ignore.also.redirects
>   true
>   If true, the fetcher checks redirects the same way as
>   links when ignoring internal or external links. Set to false to
>   follow redirects despite the values for db.ignore.external.links and
>   db.ignore.internal.links.
>   
> 
> 
> Best,
> Sebastian
> 
> 
> On 12/13/21 13:02, Roseline Antai wrote:
>> Hi,
>>
>>
>>
>> I am working with Apache nutch 1.18 and Solr. I have set up the system 
>> successfully, but I’m now having the problem that Nutch is refusing to 
>> crawl all the URLs. I am now at a loss as to what I should do to 
>> correct this problem. It fetches about half of the URLs in the seed.txt file.
>>
>>
>>
>> For instance, when I inject 20 URLs, only 9 are fetched. I have made a 
>> number of changes based on the suggestions I saw on the Nutch forum, 
>> as well as on Stack overflow, but nothing seems to work.
>>
>>
>>
>> This is what my nutch-site.xml file looks like:
>>
>>
>>
>>
>>
>>

Re: Nutch not crawling all URLs

2021-12-13 Thread Sebastian Greenholtz
I don't know how I joined this mailing list but please take me off of this
list, I have not used Nutch for a long time.

Thanks!

On Mon, Dec 13, 2021 at 7:03 AM Roseline Antai 
wrote:

> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name <http://http.agent.name>*
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> <http://db.max.outlinks.per.page>*
>
> *   -1*
>
> *   The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
> <http://db.max.outlinks.per.page> outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   *
>
> **
>
> **
>
> *  http.content.limit*
>
> *  -1*
>
> *  The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  *
>
> **
>
> **
>
> *  db.ignore.external.links.mode*
>
> *  byDomain*
>
> **
>
> **
>
> *  db.injector.overwrite*
>
> *  true*
>
> **
>
> **
>
> *  http.timeout*
>
> *  5*
>
> *  The default network timeout, in
> milliseconds.*
>
> **
>
> **
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  http.redirect.max*
>
> *  2*
>
> *  The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  *
>
> **
>
> 
>
>
>
> **
>
> *  ftp.timeout*
>
> *  10*
>
> **
>
>
>
> **
>
> *  ftp.server.timeout*
>
> *  15*
>
> **
>
>
>
> ***
>
>
>
> *property>*
>
> *  fetcher.server.delay*
>
> *  65.0*
>
> **
>
>
>
> **
>
> *  fetcher.server.min.delay*
>
> *  25.0*
>
> **
>
>
>
> **
>
> * fetcher.max.crawl.delay*
>
> * 70*
>
> * *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>


Nutch not crawling all URLs

2021-12-13 Thread Roseline Antai
Hi,

I am working with Apache nutch 1.18 and Solr. I have set up the system 
successfully, but I'm now having the problem that Nutch is refusing to crawl 
all the URLs. I am now at a loss as to what I should do to correct this 
problem. It fetches about half of the URLs in the seed.txt file.

For instance, when I inject 20 URLs, only 9 are fetched. I have made a number 
of changes based on the suggestions I saw on the Nutch forum, as well as on 
Stack overflow, but nothing seems to work.

This is what my nutch-site.xml file looks like:









http.agent.name
Nutch Crawler


http.agent.email
datalake.ng at gmail d


db.ignore.internal.links
false


db.ignore.external.links
true


  plugin.includes
  
protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier


parser.skip.truncated
false
Boolean value for whether we should skip parsing for truncated 
documents. By default this
property is activated due to extremely high levels of CPU which parsing 
can sometimes take.



   db.max.outlinks.per.page
   -1
   The maximum number of outlinks that we'll process for a page.
   If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
   will be processed for a page; otherwise, all outlinks will be processed.
   


  http.content.limit
  -1
  The length limit for downloaded content using the http://
  protocol, in bytes. If this value is nonnegative (>=0), content longer
  than it will be truncated; otherwise, no truncation at all. Do not
  confuse this setting with the file.content.limit setting.
  


  db.ignore.external.links.mode
  byDomain


  db.injector.overwrite
  true


  http.timeout
  5
  The default network timeout, in milliseconds.



Other changes I have made include changing the following in nutch-default.xml:

property>
  http.redirect.max
  2
  The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  

**




  ftp.timeout

  10







  ftp.server.timeout

  15



*


property>

  fetcher.server.delay

  65.0







  fetcher.server.min.delay

  25.0







 fetcher.max.crawl.delay

 70



I also commented out the line below in the regex-urlfilter file:


# skip URLs containing certain characters as probable queries, etc.

-[?*!@=]

Nothing seems to work.

What is it that I'm not doing, or doing wrongly here?

Regards,
Roseline

Dr Roseline Antai
Research Fellow
Hunter Centre for Entrepreneurship
Strathclyde Business School
University of Strathclyde, Glasgow, UK

[Small eMail Sig]
The University of Strathclyde is a charitable body, registered in Scotland, 
number SC015263.




Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-11-18 Thread Sebastian Nagel
The issue is now tracked in
  https://issues.apache.org/jira/browse/NUTCH-2907

On 10/28/21 15:31, Sebastian Nagel wrote:
> Hi Shi Wei,
> 
> sorry, but it looks like the Selenium protocol plugin has never been
> used with a proxy over https. There are two points which need (at a
> first glance) a rework:
> 
> 1. the protocol tries to establish a TLS/SSL connection to the proxy if
> the URL to be crawled is a https:// URL. There might be some proxies
> which can do this, but the proxies I'm aware of expect a HTTP CONNECT
> [1] for HTTPS proxying.
> 
> 2. probably also the browser / driver needs to be configured to
> use the same proxy. Afaics, this isn't done but is a requirement
> if the proxy is required for accessing web content. However, it
> might be possible by setting environment variables.
> 
> Sorry again. Feel free to open a Jira issue to get this fixed.
> 
> Best,
> Sebastian
> 
> [1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method
> 
> 
> On 10/28/21 11:45, sw.l...@quandatics.com wrote:
>> Hi there,
>>
>>  
>>
>> Good day!
>>
>>  
>>
>> We would like to crawl the web data by executing the Nutch with Selenium
>> plugin with the following command:
>>
>>  
>>
>> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
>> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
>>
>>  
>>
>> However, it failed with the following error message:
>>
>>  
>>
>> 2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = 
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
>> true
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 1
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
>> Test/Nutch-1.18
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
>> en-us,en-gb,en;q=0.7,*;q=0.3
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
>> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
>>
>> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
>> true
>>
>> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
>>
>> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
>> handshake
>>
>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
>>
>> at sun.security.ssl.SSL
>>
>>  
>>
>> FYI, we have tried the following approaches but the issues persisted.
>>
>>  
>>
>> 1. Set the http.tls.certificates.check to false
>>
>> 2. Import the website's certificates to our java truststores
>>
>> 3. Our Nutch is configured with proxy
>>
>>  
>>
>> Kindly advise. Thanks in advance!
>>
>>  
>>
>>  
>>
>> Best Regards,
>>
>> Shi Wei
>>
>>  
>>
>>


Re: javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-10-28 Thread Sebastian Nagel
Hi Shi Wei,

sorry, but it looks like the Selenium protocol plugin has never been
used with a proxy over https. There are two points which need (at a
first glance) a rework:

1. the protocol tries to establish a TLS/SSL connection to the proxy if
the URL to be crawled is a https:// URL. There might be some proxies
which can do this, but the proxies I'm aware of expect a HTTP CONNECT
[1] for HTTPS proxying.

2. probably also the browser / driver needs to be configured to
use the same proxy. Afaics, this isn't done but is a requirement
if the proxy is required for accessing web content. However, it
might be possible by setting environment variables.

Sorry again. Feel free to open a Jira issue to get this fixed.

Best,
Sebastian

[1] https://en.wikipedia.org/wiki/HTTP_tunnel#HTTP_CONNECT_method


On 10/28/21 11:45, sw.l...@quandatics.com wrote:
> Hi there,
> 
>  
> 
> Good day!
> 
>  
> 
> We would like to crawl the web data by executing the Nutch with Selenium
> plugin with the following command:
> 
>  
> 
> $ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
> https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial
> 
>  
> 
> However, it failed with the following error message:
> 
>  
> 
> 2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = 
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
> true
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 1
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
> Test/Nutch-1.18
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
> en-us,en-gb,en;q=0.7,*;q=0.3
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
> text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
> 
> 2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
> true
> 
> 2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output
> 
> javax.net.ssl.SSLHandshakeException: Remote host closed connection during
> handshake
> 
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)
> 
> at sun.security.ssl.SSL
> 
>  
> 
> FYI, we have tried the following approaches but the issues persisted.
> 
>  
> 
> 1. Set the http.tls.certificates.check to false
> 
> 2. Import the website's certificates to our java truststores
> 
> 3. Our Nutch is configured with proxy
> 
>  
> 
> Kindly advise. Thanks in advance!
> 
>  
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
> 


javax.net.ssl.SSLHandshakeException Error when Executing Nutch with Selenium Plugin

2021-10-28 Thread sw.ling
Hi there,

 

Good day!

 

We would like to crawl the web data by executing the Nutch with Selenium
plugin with the following command:

 

$ nutch plugin protocol-selenium org.apache.nutch.protocol.selenium.Http
https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial

 

However, it failed with the following error message:

 

2021-10-26 19:07:53,961 INFO  selenium.Http - http.proxy.host = xxx.xx.xx.xx

2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.port = 

2021-10-26 19:07:53,962 INFO  selenium.Http - http.proxy.exception.list =
true

2021-10-26 19:07:53,962 INFO  selenium.Http - http.timeout = 1

2021-10-26 19:07:53,962 INFO  selenium.Http - http.content.limit = 1048576

2021-10-26 19:07:53,962 INFO  selenium.Http - http.agent = Apache Nutch
Test/Nutch-1.18

2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept.language =
en-us,en-gb,en;q=0.7,*;q=0.3

2021-10-26 19:07:53,962 INFO  selenium.Http - http.accept =
text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8

2021-10-26 19:07:53,962 INFO  selenium.Http - http.enable.cookie.header =
true

2021-10-26 19:07:54,114 ERROR selenium.Http - Failed to get protocol output

javax.net.ssl.SSLHandshakeException: Remote host closed connection during
handshake

at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:994)

at sun.security.ssl.SSL

 

FYI, we have tried the following approaches but the issues persisted.

 

1. Set the http.tls.certificates.check to false

2. Import the website's certificates to our java truststores

3. Our Nutch is configured with proxy

 

Kindly advise. Thanks in advance!

 

 

Best Regards,

Shi Wei

 



Re: Cant integrate the kerberos enabled solr cloud with nutch

2021-10-22 Thread Wei
Hi Sebastian, 

Thanks for your reply. 

We understand that the current indexer-solr plugin only supports the basics 
authentication. We will open the jira issue accordingly. Usually how long it 
would take for the jira to resolve?

Besides, is there any other workaround that we could try for the described 
issue? It's because upon our research in these few days, it seems that somebody 
were able to connect to the Solr with Kerberos authentication via the NTLM 
scheme of HttpAuthenticationSchemes. Could you assist to check whether this 
would work?

https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=115512111#content/view/115512111
 
<https://cwiki.apache.org/confluence/plugins/servlet/mobile?contentId=115512111#content/view/115512111>

Your Sincerely,
Shi Wei
> On 22 Oct 2021, at 6:55 PM, Sebastian Nagel 
>  wrote:
> 
> Hi Shi Wei,
> 
> > kerberos
> 
> sorry, I missed this detail. The plugin indexer-solr for now
> only supports basic authentication.
> 
> Could you open a Jira issue to get Kerberos authentication
> implemented on the Nutch site?
>  https://issues.apache.org/jira/projects/NUTCH
> 
> See also:
>  
> https://solr.apache.org/guide/8_5/kerberos-authentication-plugin.html#using-solrj-with-a-kerberized-solr
> 
> Thanks,
> Sebastian
> 
> On 10/22/21 12:01 PM, sw.l...@quandatics.com wrote:
>> Hi Sebastian,
>> Here is the index-writers.xml  you requested. Thank
>> Your Sincerely,
>> Shi Wei
>> -Original Message-
>> From: Sebastian Nagel 
>> Sent: Friday, 22 October, 2021 5:46 PM
>> To: user@nutch.apache.org
>> Subject: Re: Cant integrate the kerberos enabled solr cloud with nutch
>> Hi Shi Wei,
>> could you also share the index writer configuration (conf/index-writers.xml)?
>> The default is unauthenticated access to Solr, see the snippet below.
>> The file httpclient-auth.xml is not relevant for the Solr indexer, it's used 
>> if a crawled web site requires authentication in order to fetch the content 
>> via the plugin protocol-httpclient.
>> Best,
>> Sebastian
>>> class="org.apache.nutch.indexwriter.solr.SolrIndexWriter">
>>  
>>
>>http://localhost:8983/solr/nutch"/>
>>
>>
>>
>>
>>
>>
>> On 10/22/21 10:10 AM, sw.l...@quandatics.com wrote:
>>> Hi,
>>> 
>>> We have encountered a problem which can’t integrate the kerberos enabled 
>>> solr cloud with nutch.
>>> 
>>> When execute "nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1
>>> -filter -normalize" command ,it will fail with "HTTP ERROR 401Problem 
>>> accessing /solr/admin/collections. Reason:Authentication required" but we 
>>> able to curl it with the keytab.
>>> 
>>> Version of Nutch :1.18
>>> 
>>> Your Sincerely,
>>> 
>>> Shi Wei
>>> 
> 



RE: Cant integrate the kerberos enabled solr cloud with nutch

2021-10-22 Thread sw.ling
Hi Sebastian,

Here is the index-writers.xml  you requested. Thank

Your Sincerely,
Shi Wei


-Original Message-
From: Sebastian Nagel  
Sent: Friday, 22 October, 2021 5:46 PM
To: user@nutch.apache.org
Subject: Re: Cant integrate the kerberos enabled solr cloud with nutch

Hi Shi Wei,

could you also share the index writer configuration (conf/index-writers.xml)?

The default is unauthenticated access to Solr, see the snippet below.
The file httpclient-auth.xml is not relevant for the Solr indexer, it's used if 
a crawled web site requires authentication in order to fetch the content via 
the plugin protocol-httpclient.

Best,
Sebastian

   
 
   
   http://localhost:8983/solr/nutch"/>
   
   
   
   
   
   


On 10/22/21 10:10 AM, sw.l...@quandatics.com wrote:
> Hi,
> 
> We have encountered a problem which can’t integrate the kerberos enabled solr 
> cloud with nutch.
> 
> When execute "nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 
> -filter -normalize" command ,it will fail with "HTTP ERROR 401Problem 
> accessing /solr/admin/collections. Reason:Authentication required" but we 
> able to curl it with the keytab.
> 
> Version of Nutch :1.18
> 
> Your Sincerely,
> 
> Shi Wei
> 


http://lucene.apache.org/nutch;
 xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance;
 xsi:schemaLocation="http://lucene.apache.org/nutch index-writers.xsd">

  

  
  http://utility.quandatics.tech:8983/solr/"/>
  
  
  
  
  
  
  
  


  


  
  


  
  

  

  



Re: Cant integrate the kerberos enabled solr cloud with nutch

2021-10-22 Thread Sebastian Nagel

Hi Shi Wei,

could you also share the index writer configuration (conf/index-writers.xml)?

The default is unauthenticated access to Solr, see the snippet below.
The file httpclient-auth.xml is not relevant for the Solr indexer, it's
used if a crawled web site requires authentication in order to fetch
the content via the plugin protocol-httpclient.

Best,
Sebastian

  

  
  http://localhost:8983/solr/nutch"/>
  
  
  
  
  
  


On 10/22/21 10:10 AM, sw.l...@quandatics.com wrote:

Hi,

We have encountered a problem which can’t integrate the kerberos enabled solr 
cloud with nutch.

When execute "nutch index crawl/crawldb/ -linkdb crawl/linkdb/ $s1 -filter -normalize" command ,it will fail with "HTTP ERROR 401Problem 
accessing /solr/admin/collections. Reason:Authentication required" but we able to curl it with the keytab.


Version of Nutch :1.18

Your Sincerely,

Shi Wei





Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-16 Thread Lewis John McGibbney
Hi Clark,
This is a lot of information... thank you for compiling it all.
Ideally the version of Hadoop being used with Nutch should ALWAYS match the 
hadoop binaries referenced in 
https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run 
into the classpath issues.
I would like to encourage you to create a wiki page so we can document this in 
a user firnedly way... would you be open to that?
You can create an account at 
https://cwiki.apache.org/confluence/display/NUTCH/Home
Thanks for your consideration.
lewismc

On 2021/07/14 18:27:23, Clark Benham  wrote: 
> Hi All,
> 
> Sebastian Helped fix my issue: using S3 as a backend I was able to get
> nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an
> oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg.
> hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running
> `hadoop version`  give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0
> jars from the hadoop download.
> Also, in the main nutch branch (
> https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently
> has dependencies on hadoop-3.1.3; eg.
> 
>  conf="*->default">
> 
> z
> 
> 
> 
> 
> 
>  conf="*->default" />
>  rev="3.1.3" conf="*->default" />
>  name="hadoop-mapreduce-client-jobclient" rev="3.1.3" conf="*->default" />
> 
> 
> I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'.
> 
> I didn't change "mapreduce.job.dir" because there's no namenode nor
> datanode processes running when using hadoop with S3, so the UI is blank.
> 
> Copied from Email with Sebastian:
> >  > The plugin loader doesn't appear to be able to read from s3 in
> nutch-1.18
> >  > with hadoop-3.2.1[1].
> 
> > I had a look into the plugin loader: it can only read from the local file
> system.
> > But that's ok because the Nutch job file is copied to the local machine
> > and unpacked. Here the paths how it looks like on one of the running
> Common Crawl
> > task nodes:
> 
> The configs for the working hadoop are as follows:
> 
> core-site.xml
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   hadoop.tmp.dir
> 
>   /home/hdoop/tmpdata
> 
> 
> 
> 
> 
>   fs.defaultFS
> 
>   s3a://my-bucket
> 
> 
> 
> 
> 
> 
> fs.s3a.access.key
> 
> KEY_PLACEHOLDER
> 
>   AWS access key ID.
> 
>Omit for IAM role-based or provider-based authentication.
> 
> 
> 
> 
> 
> 
>   fs.s3a.secret.key
> 
>   SECRET_PLACEHOLDER
> 
>   AWS secret key.
> 
>Omit for IAM role-based or provider-based authentication.
> 
> 
> 
> 
> 
> 
>   fs.s3a.aws.credentials.provider
> 
>   
> 
> Comma-separated class names of credential provider classes which
> implement
> 
> com.amazonaws.auth.AWSCredentialsProvider.
> 
> 
> These are loaded and queried in sequence for a valid set of credentials.
> 
> Each listed class must implement one of the following means of
> 
> construction, which are attempted in order:
> 
> 1. a public constructor accepting java.net.URI and
> 
> org.apache.hadoop.conf.Configuration,
> 
> 2. a public static method named getInstance that accepts no
> 
>arguments and returns an instance of
> 
>com.amazonaws.auth.AWSCredentialsProvider, or
> 
> 3. a public default constructor.
> 
> 
> Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
> allows
> 
> anonymous access to a publicly accessible S3 bucket without any
> credentials.
> 
> Please note that allowing anonymous access to an S3 bucket compromises
> 
> security and therefore is unsuitable for most use cases. It can be
> useful
> 
> for accessing public data sets without requiring AWS credentials.
> 
> 
> If unspecified, then the default list of credential provider classes,
> 
> queried in sequence, is:
> 
> 1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
> 
>Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
> 
> 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
> 
> configuration of AWS access key ID and secret access key in
> 
> environment variables named AWS_ACCESS_KEY_ID and
> 
> AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
> 
> 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
> 
> of instance profile credentials if running in an EC2 VM.
> 
>   
>

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-15 Thread Sebastian Nagel

Hi Clark,

thanks for summarizing this discussion and sharing the final configuration!

Good to know that it's possible to run Nutch on Hadoop using S3A without
using HDFS (no namenode/datanodes running).

Best,
Sebastian


Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-14 Thread Clark Benham
Hi All,

Sebastian Helped fix my issue: using S3 as a backend I was able to get
nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an
oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg.
hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running
`hadoop version`  give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0
jars from the hadoop download.
Also, in the main nutch branch (
https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently
has dependencies on hadoop-3.1.3; eg.



z










I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'.

I didn't change "mapreduce.job.dir" because there's no namenode nor
datanode processes running when using hadoop with S3, so the UI is blank.

Copied from Email with Sebastian:
>  > The plugin loader doesn't appear to be able to read from s3 in
nutch-1.18
>  > with hadoop-3.2.1[1].

> I had a look into the plugin loader: it can only read from the local file
system.
> But that's ok because the Nutch job file is copied to the local machine
> and unpacked. Here the paths how it looks like on one of the running
Common Crawl
> task nodes:

The configs for the working hadoop are as follows:

core-site.xml














  hadoop.tmp.dir

  /home/hdoop/tmpdata





  fs.defaultFS

  s3a://my-bucket






fs.s3a.access.key

KEY_PLACEHOLDER

  AWS access key ID.

   Omit for IAM role-based or provider-based authentication.






  fs.s3a.secret.key

  SECRET_PLACEHOLDER

  AWS secret key.

   Omit for IAM role-based or provider-based authentication.






  fs.s3a.aws.credentials.provider

  

Comma-separated class names of credential provider classes which
implement

com.amazonaws.auth.AWSCredentialsProvider.


These are loaded and queried in sequence for a valid set of credentials.

Each listed class must implement one of the following means of

construction, which are attempted in order:

1. a public constructor accepting java.net.URI and

org.apache.hadoop.conf.Configuration,

2. a public static method named getInstance that accepts no

   arguments and returns an instance of

   com.amazonaws.auth.AWSCredentialsProvider, or

3. a public default constructor.


Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
allows

anonymous access to a publicly accessible S3 bucket without any
credentials.

Please note that allowing anonymous access to an S3 bucket compromises

security and therefore is unsuitable for most use cases. It can be
useful

for accessing public data sets without requiring AWS credentials.


If unspecified, then the default list of credential provider classes,

queried in sequence, is:

1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:

   Uses the values of fs.s3a.access.key and fs.s3a.secret.key.

2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports

configuration of AWS access key ID and secret access key in

environment variables named AWS_ACCESS_KEY_ID and

AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.

3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use

of instance profile credentials if running in an EC2 VM.

  







  

org.apache.hadoop

hadoop-client

${hadoop.version}

  

  

org.apache.hadoop

hadoop-aws

${hadoop.version}

  








hadoop-env.sh

#

# Licensed to the Apache Software Foundation (ASF) under one

# omore contributor license agreements.  See the NOTICE file

# distributed with this work for additional information

# regarding copyright ownership.  The ASF licenses this file

# to you under the Apache License, Version 2.0 (the

# "License"); you may not use this file except in compliance

#a with the License.  You may obtain a copy of the License at

#

# http://www.apache.org/licenses/LICENSE-2.0

#

# Unless required by applicable law or agreed to in writing, software

# distributed under the License is distributed on an "AS IS" BASIS,

# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.

# See the License for the specific language governing permissions and

# limitations under the License.


# Set Hadoop-specific environment variables here.


##

## THIS FILE ACTS AS THE MASTER FILE FOR ALL HADOOP PROJECTS.

## SETTINGS HERE WILL BE READ BY ALL HADOOP COMMANDS.  THEREFORE,

## ONE CAN USE THIS FILE TO SET YARN, HDFS, AND MAPREDUCE

## CONFIGURATION OPTIONS INSTEAD OF xxx-env.sh.

##

## Precedence rules:

##

## {yarn-env.sh|hdfs-env.sh} > hadoop-env.sh > hard-coded defaults

##

## {YARN_xyz|HDFS_xyz} > HADOOP_xyz > hard-coded defaults

##


# Many of the options here are built from the perspective that users

# may want to provide OVERWRITING values on the command line.

# For example:

#

#  JAVA_HOME=/usr/java/testing hdfs dfs -ls


Looking for ntesters - Nutch Dockerfile

2021-07-01 Thread Lewis John McGibbney
Hi user@,
Are you interested in the Nutch Dockerfile? If so, keep reading.
We are looking for some assistance to test proposed additions to the Nutch 
Dockerfile.
Essentially the changes would facilitate installing and running the Nutch REST 
server and/or the Nutch WebApp in addition to the Nutch server-side 
installation.
How to build and run is all documented in the accompanying README
If you are interested, please see 

https://github.com/apache/nutch/pull/691 

..and comment in the thread.

Thanks
lewismc


Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-17 Thread Clark Benham
Hi Sebastian,

NUTCH_HOME=~/nutch; the local filesystem. I am using a plain, pre-built
hadoop.
There's no "mapreduce.job.dir" I can grep in Hadoop 3.2.1,3.3.0, or
Nutch-1.18, 1.19, but mapreduce.job.hdfs-servers defaults to
${fs.defaultFS}, so s3a://temp-crawler in our case.
The plugin loader doesn't appear to be able to read from s3 in nutch-1.18
with hadoop-3.2.1[1].

Using java & javac 11 with hadoop-3.3.0 downloaded and untared and a
nutch-1.19 I built:
I can run a mapreduce job on S3; and a Nutch job on hdfs, but running nutch
on S3 still gives "URLNormalizer not found" with the plugin dir on the
local filesystem or on S3a.

How would you recommend I go about getting the plugin loader to read from
other file systems?

[1]  I still get 'x point org.apache.nutch.net.URLNormalizer not found'
(same stack trace as previous email) with
`plugin.folders
s3a://temp-crawler/user/hdoop/nutch-plugins`
set in my nutch-site.xml while `hadoop fs -ls
s3a://temp-crawler/user/hdoop/nutch-plugins` lists all the plugins as there.


For posterity:
I got hadoop-3.3.0 working with a S3 backend by:

cd ~/hadoop-3.3.0

cp ./share/hadoop/tools/lib/hadoop-aws-3.3.0.jar ./share/hadoop/common/lib

cp ./share/hadoop/tools/lib/aws-java-sdk-bundle-1.11.563.jar
./share/hadoop/common/lib
to solve "Class org.apache.hadoop.fs.s3a.commit.S3ACommitterFactory not
found" despite the class existing in
~/hadoop-3.3.0/share/hadoop/tools/lib/hadoop-aws-3.3.0.jar  checking it's
on the classpath with `hadoop classpath | tr ":" "\n"  | grep
share/hadoop/tools/lib/hadoop-aws-3.3.0.jar` as well as adding it to
hadoop-env.sh.
see
https://stackoverflow.com/questions/58415928/spark-s3-error-java-lang-classnotfoundexception-class-org-apache-hadoop-f

On Tue, Jun 15, 2021 at 2:01 AM Sebastian Nagel
 wrote:

>  > The local file system? Or hdfs:// or even s3:// resp. s3a://?
>
> Also important: the value of "mapreduce.job.dir" - it's usually
> on hdfs:// and I'm not sure whether the plugin loader is able to
> read from other filesystems. At least, I haven't tried.
>
>
> On 6/15/21 10:53 AM, Sebastian Nagel wrote:
> > Hi Clark,
> >
> > sorry, I should read your mail until the end - you mentioned that
> > you downgraded Nutch to run with JDK 8.
> >
> > Could you share to which filesystem does NUTCH_HOME point?
> > The local file system? Or hdfs:// or even s3:// resp. s3a://?
> >
> > Best,
> > Sebastian
> >
> >
> > On 6/15/21 10:24 AM, Clark Benham wrote:
> >> Hi,
> >>
> >>
> >> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
> >> backend/filesystem; however I get an error ‘URLNormalizer class not
> found’.
> >> I have edited nutch-site.xml so this plugin should be included:
> >>
> >> 
> >>
> >>plugin.includes
> >>
> >>
> >>
> protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints
>
> >>
> >>
> >>
> >> 
> >>
> >>   and then built on both nodes (I only have 2 machines).  I’ve
> successfully
> >> run Nutch locally and in distributed mode using HDFS, and I’ve run a
> >> mapreduce job with S3 as hadoop’s file system.
> >>
> >>
> >> I thought it was possible nutch is not reading nutch-site.xml because I
> >> resolve an error by setting the config through the cli, despite this
> >> duplicating nutch-site.xml.
> >>
> >> The command:
> >>
> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.fetcher.Fetcher
> >> crawl/crawldb crawl/segments`
> >>
> >> throws
> >>
> >> `java.lang.IllegalArgumentException: Fetcher: No agents listed in '
> >> http.agent.name' property`
> >>
> >> while if I pass a value in for http.agent.name with
> >> `-Dhttp.agent.name=myScrapper`,
> >> (making the command `hadoop jar
> >> $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.fetcher.Fetcher
> >> -Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
> >> about there being no input path, which makes sense as I haven’t been
> able
> >> to generate any segments.
> >>
> >>
> >>   However this method of setting nutch config’s doesn’t work for
> injecting
> >> URLs; eg:
> >>
> >> `hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
> >> org.apache.nutch.crawl.Injector

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel

> The local file system? Or hdfs:// or even s3:// resp. s3a://?

Also important: the value of "mapreduce.job.dir" - it's usually
on hdfs:// and I'm not sure whether the plugin loader is able to
read from other filesystems. At least, I haven't tried.


On 6/15/21 10:53 AM, Sebastian Nagel wrote:

Hi Clark,

sorry, I should read your mail until the end - you mentioned that
you downgraded Nutch to run with JDK 8.

Could you share to which filesystem does NUTCH_HOME point?
The local file system? Or hdfs:// or even s3:// resp. s3a://?

Best,
Sebastian


On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints 






  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel

Hi Clark,

sorry, I should read your mail until the end - you mentioned that
you downgraded Nutch to run with JDK 8.

Could you share to which filesystem does NUTCH_HOME point?
The local file system? Or hdfs:// or even s3:// resp. s3a://?

Best,
Sebastian


On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints




  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed,
job status: FAILED, reason: Task failed task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,562 ERROR craw

Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Sebastian Nagel

Hi Clark,

the class URLNormalizer is not in a plugin - it's part of Nutch core and defines the interface for URL normalizer plugins. Looks like 
there's something wrong fundamentally, not only with the plugins.


> I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3

Are you aware that the Nutch 1.19 will require JDK 11? - and the recent Nutch 
snapshots already do,
see NUTCH-2857. Hadoop 3.2.1 does not support JDK 11, you'd need to use 3.3.0. Is a plain vanilla Hadoop used, or a specific Hadoop 
distribution (eg. Cloudera, Amazon EMR)?


Note: the normal way to run Nutch is:
  $NUTCH_HOME/runtime/deploy/bin/nutch  ...
But in the end it will also call "hadoop jar apache-nutch-xyz.job ..."

Best,
Sebastian

On 6/15/21 10:24 AM, Clark Benham wrote:

Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



   plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints




  and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


  However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken

Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-06-15 Thread Clark Benham
Hi,


I am trying to run Nutch-1.19 on hadoop-3.2.1 with an S3
backend/filesystem; however I get an error ‘URLNormalizer class not found’.
I have edited nutch-site.xml so this plugin should be included:



  plugin.includes


protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|mimetype-filter|urlnormalizer|urlnormalizer-basic|.*|nutch-extensionpoints




 and then built on both nodes (I only have 2 machines).  I’ve successfully
run Nutch locally and in distributed mode using HDFS, and I’ve run a
mapreduce job with S3 as hadoop’s file system.


I thought it was possible nutch is not reading nutch-site.xml because I
resolve an error by setting the config through the cli, despite this
duplicating nutch-site.xml.

The command:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
crawl/crawldb crawl/segments`

throws

`java.lang.IllegalArgumentException: Fetcher: No agents listed in '
http.agent.name' property`

while if I pass a value in for http.agent.name with
`-Dhttp.agent.name=myScrapper`,
(making the command `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.fetcher.Fetcher
-Dhttp.agent.name=clark crawl/crawldb crawl/segments`),  I get an error
about there being no input path, which makes sense as I haven’t been able
to generate any segments.


 However this method of setting nutch config’s doesn’t work for injecting
URLs; eg:

`hadoop jar $NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
-Dplugin.includes=".*" crawl/crawldb urls`

fails with the same “URLNormalizer” not found.


I tried copying the plugin dir to S3 and setting
plugin.folders to be a path on S3 without success. (I expect
the plugin to be bundled with the .job so this step should be unnecessary)


The full stack trace for `hadoop jar
$NUTCH_HOME/runtime/deploy/apache-nutch-1.19-SNAPSHOT.job
org.apache.nutch.crawl.Injector
crawl/crawldb urls`:

SLF4J: Class path contains multiple SLF4J bindings.

SLF4J: Found binding in
[jar:file:/home/hdoop/hadoop-3.2.1/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: Found binding in
[jar:file:/home/hdoop/apache-nutch-1.18/lib/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]

SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an
explanation.

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]

#Took out multiply Info messages

2021-06-15 07:06:07,842 INFO mapreduce.Job: Task Id :
attempt_1623740678244_0001_m_01_0, Status : FAILED

Error: java.lang.RuntimeException: x point
org.apache.nutch.net.URLNormalizer not found.

at org.apache.nutch.net.URLNormalizers.(URLNormalizers.java:145)

at org.apache.nutch.crawl.Injector$InjectMapper.setup(Injector.java:139)

at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:143)

at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:799)

at org.apache.hadoop.mapred.MapTask.run(MapTask.java:347)

at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:174)

at java.security.AccessController.doPrivileged(Native Method)

at javax.security.auth.Subject.doAs(Subject.java:422)

at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1730)

at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:168)


#This error repeats 6 times total, 3 times for each node


2021-06-15 07:06:26,035 INFO mapreduce.Job:  map 100% reduce 100%

2021-06-15 07:06:29,067 INFO mapreduce.Job: Job job_1623740678244_0001
failed with state FAILED due to: Task failed
task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,190 INFO mapreduce.Job: Counters: 14

Job Counters

Failed map tasks=7

Killed map tasks=1

Killed reduce tasks=1

Launched map tasks=8

Other local map tasks=6

Rack-local map tasks=2

Total time spent by all maps in occupied slots (ms)=63196

Total time spent by all reduces in occupied slots (ms)=0

Total time spent by all map tasks (ms)=31598

Total vcore-milliseconds taken by all map tasks=31598

Total megabyte-milliseconds taken by all map tasks=8089088

Map-Reduce Framework

CPU time spent (ms)=0

Physical memory (bytes) snapshot=0

Virtual memory (bytes) snapshot=0

2021-06-15 07:06:29,195 ERROR crawl.Injector: Injector job did not succeed,
job status: FAILED, reason: Task failed task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


2021-06-15 07:06:29,562 ERROR crawl.Injector: Injector:
java.lang.RuntimeException: Injector job did not succeed, job status:
FAILED, reason: Task failed task_1623740678244_0001_m_01

Job failed as tasks failed. failedMaps:1 failedReduces:0 killedMaps:0
killedReduces: 0


at org.apache.nutch.crawl.Injector.inject(Injecto

Re: About Nutch 1.x Rest API at port 8081

2021-06-14 Thread gokmen.yontem

Hi All,

I guess the solution was using 0.0.0.0 instead of localhost.

   command: '/root/nutch/bin/nutch startserver -port 8081 -host 0.0.0.0'

https://stackoverflow.com/a/67972504/4907821

Thanks for your help,
Gokmen


On 2021-06-13 12:37, gokmen.yontem wrote:

Hello all,

My last question turns out to be actually a docker issue, I understood
that with great embarrassment. This time, I'm pretty sure it's a
similar issue :) But believe me, I gave lots of effort into this as
well.

This documentation
(https://cwiki.apache.org/confluence/display/NUTCH/Nutch+1.X+RESTAPI)
tells me that after I run the server with `bin/nutch startserver` the
API starts working on the 8081 port.

Within the docker container (I get inside with docker exec -it) I
confirm that the API is up and running, responsive to my rest calls,
but I cannot reach it out outside of the container.

So far I have tried visiting http://localhost:8081/admin on my browser
(I have also tried other variations like 127.0.0.1, 0.0.0.0, docker
IP, machine IP, etc). Postman all of these URLs as well.

Additionally, I have added a frontend application to my docker
network, I make sure that Frontend, Solr, and Nutch are in the same
docker network, and I have tried to make rest calls to Solr and Nutch
services from my frontend application: Solr worked, Nutch didn't work.

I have also suspected that there might something wrong with my
computer, or with the windows 10 that I'm using, I have tried all of
the things above on an AWS server. No luck.

Finally, I dig into the docker world and I have tried a bunch of
things like exposing ports, different types of networks, etc, It
didn't work as well. But since I'm not having an issue with solr API,
I thought it could be right to ask this issue to the community.


Here's my repo:
https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
I asked this question on
https://stackoverflow.com/questions/67949442/apache-nutch-doesnt-expose-its-api.

Thanks for your help,
Gokmen


About Nutch 1.x Rest API at port 8081

2021-06-13 Thread gokmen.yontem

Hello all,

My last question turns out to be actually a docker issue, I understood 
that with great embarrassment. This time, I'm pretty sure it's a similar 
issue :) But believe me, I gave lots of effort into this as well.


This documentation 
(https://cwiki.apache.org/confluence/display/NUTCH/Nutch+1.X+RESTAPI) 
tells me that after I run the server with `bin/nutch startserver` the 
API starts working on the 8081 port.


Within the docker container (I get inside with docker exec -it) I 
confirm that the API is up and running, responsive to my rest calls, but 
I cannot reach it out outside of the container.


So far I have tried visiting http://localhost:8081/admin on my browser 
(I have also tried other variations like 127.0.0.1, 0.0.0.0, docker IP, 
machine IP, etc). Postman all of these URLs as well.


Additionally, I have added a frontend application to my docker network, 
I make sure that Frontend, Solr, and Nutch are in the same docker 
network, and I have tried to make rest calls to Solr and Nutch services 
from my frontend application: Solr worked, Nutch didn't work.


I have also suspected that there might something wrong with my computer, 
or with the windows 10 that I'm using, I have tried all of the things 
above on an AWS server. No luck.


Finally, I dig into the docker world and I have tried a bunch of things 
like exposing ports, different types of networks, etc, It didn't work as 
well. But since I'm not having an issue with solr API, I thought it 
could be right to ask this issue to the community.



Here's my repo: 
https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
I asked this question on 
https://stackoverflow.com/questions/67949442/apache-nutch-doesnt-expose-its-api.


Thanks for your help,
Gokmen


Re: Apache Nutch help request for a school project :)

2021-06-10 Thread lewis john mcgibbney
:)

On Thu, Jun 10, 2021 at 7:18 AM gokmen.yontem 
wrote:

> Lewis, Sebastian
> I can’t thank you enough! Your help is much appreciated.
>
> Next time I'll follow your advice and use the mailing list, which I
> wasn't aware of that.
>
> Best wishes,
> Gorkem
>
>
> On 2021-06-07 20:08, lewis john mcgibbney wrote:
> > Yep Sebastian is absolutely correct. I sent you a pull request.
> >
> > https://github.com/gorkemyontem/nutch/pull/1
> > HTH
> > lewismc
> >
> > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
> >  wrote:
> >
> >> I’ll have a look today. You can always use the mailing list as
> >> well. Feel free to post your questions there and we will help you
> >> out :)
> >>
> >> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
> >>  wrote:
> >>
> >>> Hi Lewis,
> >>> Sorry to bother you. I've been trying to configure Apache Nutch
> >>> for
> >>> almost 10 days now and I'm about to give up. I saw that you are
> >>> contributing to this project and I thought maybe you can help me.
> >>> This is how desperate I am :)
> >>>
> >>> Here's my repo if you have time:
> >>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> >>> I'm trying to use docker images so there isn't much on the repo/
> >>>
> >>> This is my current error:
> >>>
> >>> nutch| Indexer: java.lang.RuntimeException: Indexing job did
> >>> not
> >>> succeed, job status:FAILED, reason: NA
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> >>> nutch|  at
> >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
> >>>
> >>> People say that schema.xml could be wrong, but I'm using the most
> >>> up to
> >>> date one from here
> >>>
> >>
> >
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
> >>>
> >>> Many many thanks!
> >>> Best wishes,
> >>> Gorkem
> >> --
> >>
> >> http://home.apache.org/~lewismc/
> >> http://people.apache.org/keys/committer/lewismc
> >
> > --
> >
> > http://home.apache.org/~lewismc/
> > http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Apache Nutch help request for a school project :)

2021-06-07 Thread lewis john mcgibbney
Yep Sebastian is absolutely correct. I sent you a pull request.
https://github.com/gorkemyontem/nutch/pull/1
HTH
lewismc

On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney 
wrote:

> I’ll have a look today. You can always use the mailing list as well. Feel
> free to post your questions there and we will help you out :)
>
> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem 
> wrote:
>
>> Hi Lewis,
>> Sorry to bother you. I've been trying to configure Apache Nutch for
>> almost 10 days now and I'm about to give up. I saw that you are
>> contributing to this project and I thought maybe you can help me.
>> This is how desperate I am :)
>>
>> Here's my repo if you have time:
>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
>> I'm trying to use docker images so there isn't much on the repo/
>>
>> This is my current error:
>>
>> nutch| Indexer: java.lang.RuntimeException: Indexing job did not
>> succeed, job status:FAILED, reason: NA
>> nutch|  at
>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
>> nutch|  at
>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
>> nutch|  at
>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>> nutch|  at
>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
>>
>>
>> People say that schema.xml could be wrong, but I'm using the most up to
>> date one from here
>>
>> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
>>
>>
>> Many many thanks!
>> Best wishes,
>> Gorkem
>>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Apache Nutch help request for a school project :)

2021-06-07 Thread Sebastian Nagel

Hi Gorkem,

I haven't verified it by trying - but it may be that given your configuration
the Solr instance isn't reachable via
  http://localhost:8983/solr/nutch
Inside the Docker network, host names are the same as container names, that is
  http://solr:8983/solr/nutch
might work. Cf. the docker-compose networking documentation:
  https://docs.docker.com/compose/networking/

In your docker-compose.yaml there is:

services:
 solr:
   container_name: solr
   image: 'solr:8.5.2'
   ports:
 - '8983:8983'
   ...
 nutch:
  container_name: nutch
  ...
  command: '/root/nutch/bin/crawl -i -D 
solr.server.url=http://localhost:8983/solr/nutch -s urls crawl 1'

Please try to fix the URL not in the Sorl URL.

Important: you need to configure the Solr URL in the file 
conf/index-writers.xml unless you're using
Nutch 1.14 or below. See
  
https://cwiki.apache.org/confluence/display/NUTCH/NutchTutorial#NutchTutorial-SetupSolrforsearch

In any case it's important to be able to read the logs (stdout/stderr and the 
hadoop.log)! I know this
isn't trivial when using docker-compose but it will save you a lot of time when 
searching for errors.
If you need help here, please let us know. Best start a separate thread in the 
Nutch user mailing list.

Best,
Sebastian

On 6/7/21 3:18 PM, lewis john mcgibbney wrote:

I’ll have a look today. You can always use the mailing list as well. Feel
free to post your questions there and we will help you out :)

On Sun, Jun 6, 2021 at 12:43 gokmen.yontem 
wrote:


Hi Lewis,
Sorry to bother you. I've been trying to configure Apache Nutch for
almost 10 days now and I'm about to give up. I saw that you are
contributing to this project and I thought maybe you can help me.
This is how desperate I am :)

Here's my repo if you have time:
https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
I'm trying to use docker images so there isn't much on the repo/

This is my current error:

nutch| Indexer: java.lang.RuntimeException: Indexing job did not
succeed, job status:FAILED, reason: NA
nutch|  at
org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
nutch|  at
org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
nutch|  at
org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
nutch|  at
org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)


People say that schema.xml could be wrong, but I'm using the most up to
date one from here

https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml


Many many thanks!
Best wishes,
Gorkem





Re: Apache Nutch help request for a school project :)

2021-06-07 Thread lewis john mcgibbney
I’ll have a look today. You can always use the mailing list as well. Feel
free to post your questions there and we will help you out :)

On Sun, Jun 6, 2021 at 12:43 gokmen.yontem 
wrote:

> Hi Lewis,
> Sorry to bother you. I've been trying to configure Apache Nutch for
> almost 10 days now and I'm about to give up. I saw that you are
> contributing to this project and I thought maybe you can help me.
> This is how desperate I am :)
>
> Here's my repo if you have time:
> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> I'm trying to use docker images so there isn't much on the repo/
>
> This is my current error:
>
> nutch| Indexer: java.lang.RuntimeException: Indexing job did not
> succeed, job status:FAILED, reason: NA
> nutch|  at
> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> nutch|  at
> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> nutch|  at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> nutch|  at
> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
>
>
> People say that schema.xml could be wrong, but I'm using the most up to
> date one from here
>
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
>
>
> Many many thanks!
> Best wishes,
> Gorkem
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-04 Thread Sebastian Nagel

Hi Lewis, hi Markus,

> snappy compression, which is a massive improvement for large data shuffling 
jobs

Yes, I can confirm this. Also: it's worth to consider zstd for all data kept for
longer. We use it for a 25-billion CrawlDB: it's almost as fast (both 
compression
and decompression) as snappy and you get a compression ratio which is not far 
away
from bzip2 (which is very slow).

> worker/task nodes were run on spot instances

we do the same. However, the EMR is priced at 25% of the on-demand EC2
instance price. As spot prices are usually 50-70% of the on-demand price,
the EMR costs would add a non-trivial part.

> backup logic

Yes. We checkpoint the output of every step to S3 unless it runs less than
one hour.

> using Terraform or AWS CloudFormation

We use a shell script to bootstrap the instances.
Some parts have been added/rewritten using CloudFormation (the VPC setup).
The long-term plan is to use more templates and also bake a base machine
image to speed up the bootstraping.

> ARM support

ARM instances often (including spot instances) offer a better price/CPU ratio.
We've already switched to ARM for a couple of services/tasks - the efforts
are minimal: choose the right base image, installing and running Java or Python
workflows does not change. But I haven't tried Hadoop yet.

Thanks for sharing your experiences, I'll keep you updated about our decisions
and progress!

Best,
Sebastian


Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-03 Thread Nicholas Roberts
Does the Apache Bigtop project not meet the requirements of a free
distribution?
https://github.com/apache/bigtop

What is the status of that project?


  1   2   3   4   5   6   7   8   9   10   >