[ANNOUNCE] Apache Nutch 1.20 Release

2024-04-28 Thread lewis john mcgibbney
The Apache Nutch Project https://nutch.apache.org/download/

Please verify signatures using the KEYS file
https://raw.githubusercontent.com/apache/nutch/master/KEYS when downloading
the release.

This release includes more than 60 bug fixes and improvements, the full
list of changes can be seen in the Jira release report
https://s.apache.org/ovjf3

Thanks to everyone who contributed to this release!

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[RESULT] WAS Re: [VOTE] Apache Nutch 1.20 Release

2024-04-24 Thread lewis john mcgibbney
Hi user@ & dev@,
I’m glad to conclude the Nutch 1.20 release candidate VOTE thread with the
following RESULT’s.

[5] +1 Release this package as Apache Nutch 1.20
snagel*
balakuntala*
blackice*
Joe Gilvary
lewismc*

[ ] -1 Do not release this package because…

*Nutch Project Management Committee-binding

The Nutch 1.20 release candidate has passed the community VOTE. I will
therefore promote this release casndidate.

Thanks for VOTE’ing and for everyone who contributed to the Apache Nutch
1.20 release.

lewismc

On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.20 release is available at [0] where
> accompanying SHA512 and ASC signatures can also be found.
> Information on verifying releases can be found at [1].
>
> The release candidate comprises a .zip and tar.gz archive of the sources
> at [2] and complementary binary distributions. In addition, a staged maven
> repository is available at [3].
>
> The Nutch 1.20 release report is available at [4].
>
> Please vote on releasing this package as Apache Nutch 1.20. The vote is
> open for at least the next 72 hours and passes if a majority of at least
> three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.20.
>
> [ ] -1 Do not release this package because…
>
> Cheers,
> lewismc
> P.S. Here is my +1.
>
> [0] https://dist.apache.org/repos/dist/dev/nutch/1.20
> [1] http://nutch.apache.org/downloads.html#verify
> [2] https://github.com/apache/nutch/tree/release-1.20
> [3]
> https://repository.apache.org/content/repositories/orgapachenutch-1021/
> [4] https://s.apache.org/ovjf3
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Help posting question

2024-04-24 Thread Lewis John McGibbney
Hi Sheham,

On 2024/04/20 08:47:41 Sheham Izat wrote:

> The Fetcher job was aborted, does that still mean that it went through the
> entire list of seed urls?

Yes it processed the entire generated segment but the fetcher…

* hung on https://disneyland.disney.go.com/, https://api.onlyoffice.com/,  
https://www.adu.com/ and https://www.lowes.com/
* was denied by robots.txt for https://sourceforge.net/, 
https://onsclothing.com/, https://kinto-usa.com/, https://twitter.com/, 
https://www.linkedin.com/, etc.
* encountered problems processing some robots.txt files for 
https://twitter.com/, https://www.trustradius.com/
There may be some other issues encountered buy the fetcher. 

This is not at all uncommon. The fetcher completed successfully after 7 
seconds. You could progress with your crawl.

> 
> I will go through the mailing list questions.

If you need more assistance please let us know. You will find plenty of 
pointers on this mailing list archive though.

lewismc


Re: Help posting question

2024-04-19 Thread Lewis John McGibbney
Hi Sheham,

On 2024/04/19 15:18:01 Sheham Izat wrote:
> 
> My questions are:
> 
> 1) What do I need to do to get Nutch to continue working even if there are
> hung threads?

>From what I can see in the log you provided, nothing is preventing Nutch from 
>continuing to work. The Fetcher job finished successfully.

> 2) Is there a way to avoid having these hanging threads in the first place?

Several factors can lead to hung fetcher threads. Lots of questions have been 
asked on this mailing list relating to exactly this issue. I would encourage 
you to study some of the community responses and see if they assist you in a 
better understanding of the possible issues. You can filter questions in the 
mailing list search with the following criteria
* date range: more than 1 days ago
* body: hung

https://lists.apache.org/list.html?user@nutch.apache.org


Re: [VOTE] Apache Nutch 1.20 Release

2024-04-16 Thread lewis john mcgibbney
Hi user@, dev@,
Please consider reviewing the Nutch 1.20 release candidate. This is a
critical prerequisite for us making releases of software at TheASF.
Thank you
lewismc

On Tue, Apr 9, 2024 at 2:28 PM lewis john mcgibbney 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.20 release is available at [0] where
> accompanying SHA512 and ASC signatures can also be found.
> Information on verifying releases can be found at [1].
>
> The release candidate comprises a .zip and tar.gz archive of the sources
> at [2] and complementary binary distributions. In addition, a staged maven
> repository is available at [3].
>
> The Nutch 1.20 release report is available at [4].
>
> Please vote on releasing this package as Apache Nutch 1.20. The vote is
> open for at least the next 72 hours and passes if a majority of at least
> three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch X.XX.
>
> [ ] -1 Do not release this package because…
>
> Cheers,
> lewismc
> P.S. Here is my +1.
>
> [0] https://dist.apache.org/repos/dist/dev/nutch/1.20
> [1] http://nutch.apache.org/downloads.html#verify
> [2] https://github.com/apache/nutch/tree/release-1.20
> [3]
> https://repository.apache.org/content/repositories/orgapachenutch-1021/
> [4] https://s.apache.org/ovjf3
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[VOTE] Apache Nutch 1.20 Release

2024-04-09 Thread lewis john mcgibbney
Hi Folks,

A first candidate for the Nutch 1.20 release is available at [0] where
accompanying SHA512 and ASC signatures can also be found.
Information on verifying releases can be found at [1].

The release candidate comprises a .zip and tar.gz archive of the sources at
[2] and complementary binary distributions. In addition, a staged maven
repository is available at [3].

The Nutch 1.20 release report is available at [4].

Please vote on releasing this package as Apache Nutch 1.20. The vote is
open for at least the next 72 hours and passes if a majority of at least
three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch X.XX.

[ ] -1 Do not release this package because…

Cheers,
lewismc
P.S. Here is my +1.

[0] https://dist.apache.org/repos/dist/dev/nutch/1.20
[1] http://nutch.apache.org/downloads.html#verify
[2] https://github.com/apache/nutch/tree/release-1.20
[3] https://repository.apache.org/content/repositories/orgapachenutch-1021/
[4] https://s.apache.org/ovjf3

--
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[GSoC 2024 PROPOSAL] Overhaul the legacy Nutch plugin framework and replace it with PF4J

2024-03-12 Thread lewis john mcgibbney
Hi user@ & dev@,

I decided to write up a GSoC’24 proposal and encourage interested
applicants to register your interest in the JIRA issue or else reach
out to the Nutch PMC over on d...@nutch.apache.org (please CC
lewi...@apache.org).

Title: Overhaul the legacy Nutch plugin framework and replace it with PF4J
JIRA: https://issues.apache.org/jira/browse/NUTCH-3034

Thanks in advance, and good luck to prospective GSoC applicants.

lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: [DISCUSS] Removing Any23 from Nutch?

2023-09-14 Thread lewis john mcgibbney
+1 Tim.


On Wed, Sep 13, 2023 at 16:50 

>
>
>
> -- Forwarded message --
> From: Tim Allison 
> To: user@nutch.apache.org, d...@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 13 Sep 2023 10:50:08 -0400
> Subject: [DISCUSS] Removing Any23 from Nutch?
> All,
>   I opened https://issues.apache.org/jira/browse/NUTCH-2998 a few weeks
> ago.  Any23 was moved to the attic in June. Unless there are objections, I
> propose removing it from Nutch before the next release.
>   Any objections?
>
>Best,
>
>Tim
>


Re: user Digest 8 Nov 2022 10:16:05 -0000 Issue 3169

2022-11-08 Thread lewis john mcgibbney
Hi Mike,

Yes it is possible to extend the TLD list. In fact, when the TLD lost was
compiled the author left a note explicitly stating that it may not be
complete.
https://github.com/apache/nutch/blob/master/conf/domain-suffixes.xml.template
Please submit a PR if you wish to make any changes or additions. You can
use the parser checker tool to validate your change before creating the PR.
Thanks
lewismc

On Tue, Nov 8, 2022 at 02:16  wrote:

>
> -- Forwarded message --
> From: Mike 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 8 Nov 2022 11:15:51 +0100
> Subject: Incomplete TLD List
> Hi!
> Some of the new TLDs are wrongly indexed by Nutch, is it possible to extend
> the TLD list?
>
> "url":"https://about.google/intl/en_FR/how-our-business-works/;,
> "tstamp":"2022-11-06T17:22:14.808Z",
> "domain":"google",
> "digest":"3b9a23d42f200392d12a697bbb8d4d87",
>
>
> Thanks
>
> Mike
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Unable to fetch data from segment folder

2022-01-11 Thread Lewis John McGibbney
I created  https://issues.apache.org/jira/browse/NUTCH-2931 to track all of 
this work.
If you are interested in working on any of this it would be great to 
collaborate.
There is much more we can do over and above the few tickets I created.
lewismc

On 2021/12/24 10:07:20 sw.l...@quandatics.com wrote:
> Hi, 
> 
>  
> 
> We are currently facing a problem when using NUTCH Rest API. We try to run
> Nutch API through Postman and It works perfectly fine if we don't define the
> segment pathway. This is the command we run in Postman.
> 
>  
> 
> Inject
> 
>  
> 
> {
> 
> "type":"INJECT",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",
> 
>   "crawldb": "/tmp/crawl/crawldb"
> 
> }
> 
> }
> 
>  
> 
> Generate
> 
>  
> 
> {
> 
> "type":"GENERATE",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"crawldb": "/tmp/crawl/crawldb",
> 
> "segment_dir": "/tmp/crawl/segments"
> 
>}
> 
> }
> 
>  
> 
> Fetch 
> 
>  
> 
> {
> 
> "type":"FETCH",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"segment": "/tmp/crawl/segments"}
> 
> }
> 
>  
> 
> We try to define the pathway to store the crawled data in a specific
> directory. However, when come to fetch part, it cannot retrieve data from a
> specific folder (folder name that is generated by current date and time)
> under the segments folder. We have tried /tmp/crawl/segments/* and it can
> successfully retrieve the data, but it will also generate a new folder
> called *. 
> 
>  
> 
> Therefore, may we know if there is any way that could define the folder name
> in segments folder or is it got other way to change the output directory?
> 
>  
> 
> Attached is our log for your reference. Kindly advise. Thanks in advance.
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
> 


Re: Unable to fetch data from segment folder

2022-01-11 Thread Lewis John McGibbney
Hi Shi Wei,
I missed this thread over the holidays!
Which version of Nutch are you using?
The REST API needs quite a bit of attention. It is not a particularly mature 
aspect of the Nutch codebase and there are a catalog of issues which needs to 
be addressed.
If you are interested in learning about these issues then we can create an EPIC 
issue in JIRA and then begin flushing out all of the things wrong.
lewismc

On 2021/12/24 10:07:20 sw.l...@quandatics.com wrote:
> Hi, 
> 
>  
> 
> We are currently facing a problem when using NUTCH Rest API. We try to run
> Nutch API through Postman and It works perfectly fine if we don't define the
> segment pathway. This is the command we run in Postman.
> 
>  
> 
> Inject
> 
>  
> 
> {
> 
> "type":"INJECT",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"url_dir":"/opt/apache-nutch-1.18/runtime/local/urls/seed.txt",
> 
>   "crawldb": "/tmp/crawl/crawldb"
> 
> }
> 
> }
> 
>  
> 
> Generate
> 
>  
> 
> {
> 
> "type":"GENERATE",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"crawldb": "/tmp/crawl/crawldb",
> 
> "segment_dir": "/tmp/crawl/segments"
> 
>}
> 
> }
> 
>  
> 
> Fetch 
> 
>  
> 
> {
> 
> "type":"FETCH",
> 
> "confId":"default",
> 
> "crawlId":"crawl01",
> 
> "args": {"segment": "/tmp/crawl/segments"}
> 
> }
> 
>  
> 
> We try to define the pathway to store the crawled data in a specific
> directory. However, when come to fetch part, it cannot retrieve data from a
> specific folder (folder name that is generated by current date and time)
> under the segments folder. We have tried /tmp/crawl/segments/* and it can
> successfully retrieve the data, but it will also generate a new folder
> called *. 
> 
>  
> 
> Therefore, may we know if there is any way that could define the folder name
> in segments folder or is it got other way to change the output directory?
> 
>  
> 
> Attached is our log for your reference. Kindly advise. Thanks in advance.
> 
>  
> 
> Best Regards,
> 
> Shi Wei
> 
>  
> 
> 


!! Join the #nutch Slack channel !!

2021-12-29 Thread lewis john mcgibbney
Hi user@, dev@,
I took the liberty of setting up a #nutch channel for our community to
communicate in a lower latency manner.
First join the-asf.slack.com Slack workspace
https://infra.apache.org/slack.html
Then simply join the #nutch channel.
See you there :)
Thanks
lewismc

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Nutch not crawling all URLs

2021-12-13 Thread lewis john mcgibbney
Hi Roseline,
Looks like you are ignoring external URLs… that could be the problem right
there.
I encourage you to track counters on inject, generate and fetch phases to
understand where records may be being dropped.
Are the seeds you are using public? If so please post your seed file so we
can try.
Thank you
lewismc

On Mon, Dec 13, 2021 at 04:02  wrote:

>
> user Digest 13 Dec 2021 12:02:41 - Issue 3132
>
> Topics (messages 34682 through 34682)
>
> Nutch not crawling all URLs
> 34682 by: Roseline Antai
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Roseline Antai 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 13 Dec 2021 12:02:26 +
> Subject: Nutch not crawling all URLs
>
> Hi,
>
>
>
> I am working with Apache nutch 1.18 and Solr. I have set up the system
> successfully, but I’m now having the problem that Nutch is refusing to
> crawl all the URLs. I am now at a loss as to what I should do to correct
> this problem. It fetches about half of the URLs in the seed.txt file.
>
>
>
> For instance, when I inject 20 URLs, only 9 are fetched. I have made a
> number of changes based on the suggestions I saw on the Nutch forum, as
> well as on Stack overflow, but nothing seems to work.
>
>
>
> This is what my nutch-site.xml file looks like:
>
>
>
>
>
> **
>
> **
>
>
>
> **
>
>
>
> **
>
> **
>
> *http.agent.name *
>
> *Nutch Crawler*
>
> **
>
> **
>
> *http.agent.email *
>
> *datalake.ng at gmail d *
>
> **
>
> **
>
> *db.ignore.internal.links*
>
> *false*
>
> **
>
> **
>
> *db.ignore.external.links*
>
> *true*
>
> **
>
> **
>
> *  plugin.includes*
>
> *
> protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)|language-identifier*
>
> **
>
> **
>
> *parser.skip.truncated*
>
> *false*
>
> *Boolean value for whether we should skip parsing for
> truncated documents. By default this*
>
> *property is activated due to extremely high levels of CPU which
> parsing can sometimes take.*
>
> **
>
> **
>
> **
>
> *   db.max.outlinks.per.page
> *
>
> *   -1*
>
> *   The maximum number of outlinks that we'll process for a
> page.*
>
> *   If this value is nonnegative (>=0), at most db.max.outlinks.per.page
>  outlinks*
>
> *   will be processed for a page; otherwise, all outlinks will be
> processed.*
>
> *   *
>
> **
>
> **
>
> *  http.content.limit*
>
> *  -1*
>
> *  The length limit for downloaded content using the http://*
>
> *  protocol, in bytes. If this value is nonnegative (>=0), content longer*
>
> *  than it will be truncated; otherwise, no truncation at all. Do not*
>
> *  confuse this setting with the file.content.limit setting.*
>
> *  *
>
> **
>
> **
>
> *  db.ignore.external.links.mode*
>
> *  byDomain*
>
> **
>
> **
>
> *  db.injector.overwrite*
>
> *  true*
>
> **
>
> **
>
> *  http.timeout*
>
> *  5*
>
> *  The default network timeout, in
> milliseconds.*
>
> **
>
> **
>
>
>
> Other changes I have made include changing the following in
> nutch-default.xml:
>
>
>
> *property>*
>
> *  http.redirect.max*
>
> *  2*
>
> *  The maximum number of redirects the fetcher will follow
> when*
>
> *  trying to fetch a page. If set to negative or 0, fetcher won't
> immediately*
>
> *  follow redirected URLs, instead it will record them for later fetching.*
>
> *  *
>
> **
>
> 
>
>
>
> **
>
> *  ftp.timeout*
>
> *  10*
>
> **
>
>
>
> **
>
> *  ftp.server.timeout*
>
> *  15*
>
> **
>
>
>
> ***
>
>
>
> *property>*
>
> *  fetcher.server.delay*
>
> *  65.0*
>
> **
>
>
>
> **
>
> *  fetcher.server.min.delay*
>
> *  25.0*
>
> **
>
>
>
> **
>
> * fetcher.max.crawl.delay*
>
> * 70*
>
> * *
>
>
>
> I also commented out the line below in the regex-urlfilter file:
>
>
>
> *# skip URLs containing certain characters as probable queries, etc.*
>
> *-[?*!@=]*
>
>
>
> Nothing seems to work.
>
>
>
> What is it that I’m not doing, or doing wrongly here?
>
>
>
> Regards,
>
> Roseline
>
>
>
> *Dr Roseline Antai*
>
> *Research Fellow*
>
> Hunter Centre for Entrepreneurship
>
> Strathclyde Business School
>
> University of Strathclyde, Glasgow, UK
>
>
>
> [image: Small eMail Sig]
>
> The University of Strathclyde is a charitable body, registered in
> Scotland, number SC015263.
>
>
>
>
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Running Nutch on Hadoop with S3 filesystem; 'URLNormlizer not found'

2021-07-16 Thread Lewis John McGibbney
Hi Clark,
This is a lot of information... thank you for compiling it all.
Ideally the version of Hadoop being used with Nutch should ALWAYS match the 
hadoop binaries referenced in 
https://github.com/apache/nutch/blob/master/ivy/ivy.xml. This way you wont run 
into the classpath issues.
I would like to encourage you to create a wiki page so we can document this in 
a user firnedly way... would you be open to that?
You can create an account at 
https://cwiki.apache.org/confluence/display/NUTCH/Home
Thanks for your consideration.
lewismc

On 2021/07/14 18:27:23, Clark Benham  wrote: 
> Hi All,
> 
> Sebastian Helped fix my issue: using S3 as a backend I was able to get
> nutch-1.19 working with pre-built hadoop-3.3.0 and java 11. There was an
> oddity that nutch-1.19 had 11 hadoop 3.1.3 jars, eg.
> hadoop-hdfs-3.1.3.jar, hadoop-yarn-api-3.1.3.jar... ; this made running
> `hadoop version`  give 3.1.3) so I replaced those 3.1.3 jars with the 3.3.0
> jars from the hadoop download.
> Also, in the main nutch branch (
> https://github.com/apache/nutch/blob/master/ivy/ivy.xml) ivy.xml currently
> has dependencies on hadoop-3.1.3; eg.
> 
>  conf="*->default">
> 
> z
> 
> 
> 
> 
> 
>  conf="*->default" />
>  rev="3.1.3" conf="*->default" />
>  name="hadoop-mapreduce-client-jobclient" rev="3.1.3" conf="*->default" />
> 
> 
> I set yarn.nodemanager.local-dirs to '${hadoop.tmp.dir}/nm-local-dir'.
> 
> I didn't change "mapreduce.job.dir" because there's no namenode nor
> datanode processes running when using hadoop with S3, so the UI is blank.
> 
> Copied from Email with Sebastian:
> >  > The plugin loader doesn't appear to be able to read from s3 in
> nutch-1.18
> >  > with hadoop-3.2.1[1].
> 
> > I had a look into the plugin loader: it can only read from the local file
> system.
> > But that's ok because the Nutch job file is copied to the local machine
> > and unpacked. Here the paths how it looks like on one of the running
> Common Crawl
> > task nodes:
> 
> The configs for the working hadoop are as follows:
> 
> core-site.xml
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   hadoop.tmp.dir
> 
>   /home/hdoop/tmpdata
> 
> 
> 
> 
> 
>   fs.defaultFS
> 
>   s3a://my-bucket
> 
> 
> 
> 
> 
> 
> fs.s3a.access.key
> 
> KEY_PLACEHOLDER
> 
>   AWS access key ID.
> 
>Omit for IAM role-based or provider-based authentication.
> 
> 
> 
> 
> 
> 
>   fs.s3a.secret.key
> 
>   SECRET_PLACEHOLDER
> 
>   AWS secret key.
> 
>Omit for IAM role-based or provider-based authentication.
> 
> 
> 
> 
> 
> 
>   fs.s3a.aws.credentials.provider
> 
>   
> 
> Comma-separated class names of credential provider classes which
> implement
> 
> com.amazonaws.auth.AWSCredentialsProvider.
> 
> 
> These are loaded and queried in sequence for a valid set of credentials.
> 
> Each listed class must implement one of the following means of
> 
> construction, which are attempted in order:
> 
> 1. a public constructor accepting java.net.URI and
> 
> org.apache.hadoop.conf.Configuration,
> 
> 2. a public static method named getInstance that accepts no
> 
>arguments and returns an instance of
> 
>com.amazonaws.auth.AWSCredentialsProvider, or
> 
> 3. a public default constructor.
> 
> 
> Specifying org.apache.hadoop.fs.s3a.AnonymousAWSCredentialsProvider
> allows
> 
> anonymous access to a publicly accessible S3 bucket without any
> credentials.
> 
> Please note that allowing anonymous access to an S3 bucket compromises
> 
> security and therefore is unsuitable for most use cases. It can be
> useful
> 
> for accessing public data sets without requiring AWS credentials.
> 
> 
> If unspecified, then the default list of credential provider classes,
> 
> queried in sequence, is:
> 
> 1. org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider:
> 
>Uses the values of fs.s3a.access.key and fs.s3a.secret.key.
> 
> 2. com.amazonaws.auth.EnvironmentVariableCredentialsProvider: supports
> 
> configuration of AWS access key ID and secret access key in
> 
> environment variables named AWS_ACCESS_KEY_ID and
> 
> AWS_SECRET_ACCESS_KEY, as documented in the AWS SDK.
> 
> 3. com.amazonaws.auth.InstanceProfileCredentialsProvider: supports use
> 
> of instance profile credentials if running in an EC2 VM.
> 
>   
> 
> 
> 
> 
> 
> 
> 
>   
> 
> org.apache.hadoop
> 
> hadoop-client
> 
> ${hadoop.version}
> 
>   
> 
>   
> 
> org.apache.hadoop
> 
> hadoop-aws
> 
> ${hadoop.version}
> 
>   
> 
> 
> 
> 
> 
> 
> 
> 
> hadoop-env.sh
> 
> #
> 
> # Licensed to the Apache Software Foundation (ASF) under one
> 
> # omore contributor license agreements.  See the NOTICE file
> 
> # distributed with this work for additional information
> 
> # regarding copyright ownership.  The ASF licenses this file
> 
> # to you under the Apache License, Version 2.0 (the
> 
> # "License"); you may not use this file except in compliance

Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-08 Thread Lewis John McGibbney
OK, I'm going to try out Selenium Grid 4 and record my experience in a wiki 
page.
I'll write back here in due course.
Thanks

On 2021/07/08 17:11:56, Abhay Ratnaparkhi  wrote: 
> Hello Lewis,
> 
> Sorry for the late reply, I missed your email.
> The version we used is 3.141.59. As I mentioned earlier, we moved to using
> puppeteer instead of selenium.
> 
> 
> Thank you
> ~Abhay
> 
> 
> Below was the hub configuration.
> 
> 
> ```
> hub:
> image: "selenium/hub"
> tag: "3.141.59"
> port: 
> servicePort: 
> readinessTimeout: 40
> readinessDelay: 40
> livenessTimeout: 160
> javaOpts: "-Xmx8192m"
> resources:
> limits:
> cpu: "7"
> memory: "9Gi"
> gridNewSessionWaitTimeout: -1
> gridJettyMaxThreads: 750
> gridNodePolling: 1
> gridCleanUpCycle: 5000
> gridTimeout: 360
> gridBrowserTimeout: 120
> gridMaxSession: 5
> gridUnregisterIfStillDownAfter: 60
> chrome:
> enabled: true
> image: "selenium/node-chrome"
> tag: "3.141.59"
> replicas: 60
> nodeMaxSession: 5
> nodeRegistryCycle: 5000
> javaOpts: "-Xmx2048m"
> resources:
> limits:
> cpu: "1200m"
> memory: "3000Mi"
> 
> On Thu, Jul 1, 2021 at 3:06 PM Lewis John McGibbney 
> wrote:
> 
> > Hi Abhay,
> >
> > On 2021/06/10 22:27:42, Abhay Ratnaparkhi 
> > wrote:
> >
> > >
> > > Based on selenium I created a microservice (which handles all required
> > SSO
> > > redirections/ OTP handlings etc) and hosted that with a selenium grid in
> > > the kubernetes cluster for scaling.
> > > I found that we couldn't scale this approach beyond a certain point and
> > the
> > > selenium hub in the selenium grid can not be scaled horizontally.
> >
> > Which version of Selenium Grid and Hub did you use?
> > I haven't used either for a while... I did see that Grid 4 is available
> > https://www.selenium.dev/documentation/en/grid/grid_4/
> >
> > lewismc
> >
> 


Looking for ntesters - Nutch Dockerfile

2021-07-01 Thread Lewis John McGibbney
Hi user@,
Are you interested in the Nutch Dockerfile? If so, keep reading.
We are looking for some assistance to test proposed additions to the Nutch 
Dockerfile.
Essentially the changes would facilitate installing and running the Nutch REST 
server and/or the Nutch WebApp in addition to the Nutch server-side 
installation.
How to build and run is all documented in the accompanying README
If you are interested, please see 

https://github.com/apache/nutch/pull/691 

..and comment in the thread.

Thanks
lewismc


Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-07-01 Thread Lewis John McGibbney
Hi Abhay,

On 2021/06/10 22:27:42, Abhay Ratnaparkhi  wrote: 

> 
> Based on selenium I created a microservice (which handles all required SSO
> redirections/ OTP handlings etc) and hosted that with a selenium grid in
> the kubernetes cluster for scaling.
> I found that we couldn't scale this approach beyond a certain point and the
> selenium hub in the selenium grid can not be scaled horizontally.

Which version of Selenium Grid and Hub did you use?
I haven't used either for a while... I did see that Grid 4 is available
https://www.selenium.dev/documentation/en/grid/grid_4/

lewismc


Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-12 Thread lewis john mcgibbney
Yes you are hitting the exact same problems that we did. This presents a
major persistent challenge for using Nutch across the enterprise as it
quite frankly doesn’t scale.
I’m going to take next week to have a look into this specific issue and see
what I can come up with.
By any chance are you able to share your K8s configuration management here?
Are you using Helm?
Are you running Nutch in K8s or via some other deployment?
Next week I’m also looking into building our  CloudFormation template for
Nutch on EMR with Ranger included and will donate this to the Nutch
project.

On Sat, Jun 12, 2021 at 17:36  wrote:

>
> user Digest 13 Jun 2021 00:36:36 - Issue 3108
>
> Topics (messages 34633 through 34634)
>
> Re: Apache Nutch help request for a school project :)
>     34633 by: lewis john mcgibbney
>
> Re: Crawling pages behind SSO authentication (SAML/OIDC)
> 34634 by: Abhay Ratnaparkhi
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: lewis john mcgibbney 
> To: "gokmen.yontem" 
> Cc: Sebastian Nagel , user@nutch.apache.org
> Bcc:
> Date: Thu, 10 Jun 2021 09:53:31 -0700
> Subject: Re: Apache Nutch help request for a school project :)
> :)
>
> On Thu, Jun 10, 2021 at 7:18 AM gokmen.yontem 
> wrote:
>
> > Lewis, Sebastian
> > I can’t thank you enough! Your help is much appreciated.
> >
> > Next time I'll follow your advice and use the mailing list, which I
> > wasn't aware of that.
> >
> > Best wishes,
> > Gorkem
> >
> >
> > On 2021-06-07 20:08, lewis john mcgibbney wrote:
> > > Yep Sebastian is absolutely correct. I sent you a pull request.
> > >
> > > https://github.com/gorkemyontem/nutch/pull/1
> > > HTH
> > > lewismc
> > >
> > > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
> > >  wrote:
> > >
> > >> I’ll have a look today. You can always use the mailing list as
> > >> well. Feel free to post your questions there and we will help you
> > >> out :)
> > >>
> > >> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
> > >>  wrote:
> > >>
> > >>> Hi Lewis,
> > >>> Sorry to bother you. I've been trying to configure Apache Nutch
> > >>> for
> > >>> almost 10 days now and I'm about to give up. I saw that you are
> > >>> contributing to this project and I thought maybe you can help me.
> > >>> This is how desperate I am :)
> > >>>
> > >>> Here's my repo if you have time:
> > >>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> > >>> I'm trying to use docker images so there isn't much on the repo/
> > >>>
> > >>> This is my current error:
> > >>>
> > >>> nutch| Indexer: java.lang.RuntimeException: Indexing job did
> > >>> not
> > >>> succeed, job status:FAILED, reason: NA
> > >>> nutch|  at
> > >>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> > >>> nutch|  at
> > >>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> > >>> nutch|  at
> > >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> > >>> nutch|  at
> > >>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
> > >>>
> > >>> People say that schema.xml could be wrong, but I'm using the most
> > >>> up to
> > >>> date one from here
> > >>>
> > >>
> > >
> >
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
> > >>>
> > >>> Many many thanks!
> > >>> Best wishes,
> > >>> Gorkem
> > >> --
> > >>
> > >> http://home.apache.org/~lewismc/
> > >> http://people.apache.org/keys/committer/lewismc
> > >
> > > --
> > >
> > > http://home.apache.org/~lewismc/
> > > http://people.apache.org/keys/committer/lewismc
> >
>
>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/k

Re: Apache Nutch help request for a school project :)

2021-06-10 Thread lewis john mcgibbney
:)

On Thu, Jun 10, 2021 at 7:18 AM gokmen.yontem 
wrote:

> Lewis, Sebastian
> I can’t thank you enough! Your help is much appreciated.
>
> Next time I'll follow your advice and use the mailing list, which I
> wasn't aware of that.
>
> Best wishes,
> Gorkem
>
>
> On 2021-06-07 20:08, lewis john mcgibbney wrote:
> > Yep Sebastian is absolutely correct. I sent you a pull request.
> >
> > https://github.com/gorkemyontem/nutch/pull/1
> > HTH
> > lewismc
> >
> > On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney
> >  wrote:
> >
> >> I’ll have a look today. You can always use the mailing list as
> >> well. Feel free to post your questions there and we will help you
> >> out :)
> >>
> >> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem
> >>  wrote:
> >>
> >>> Hi Lewis,
> >>> Sorry to bother you. I've been trying to configure Apache Nutch
> >>> for
> >>> almost 10 days now and I'm about to give up. I saw that you are
> >>> contributing to this project and I thought maybe you can help me.
> >>> This is how desperate I am :)
> >>>
> >>> Here's my repo if you have time:
> >>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> >>> I'm trying to use docker images so there isn't much on the repo/
> >>>
> >>> This is my current error:
> >>>
> >>> nutch| Indexer: java.lang.RuntimeException: Indexing job did
> >>> not
> >>> succeed, job status:FAILED, reason: NA
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> >>> nutch|  at
> >>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> >>> nutch|  at
> >>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
> >>>
> >>> People say that schema.xml could be wrong, but I'm using the most
> >>> up to
> >>> date one from here
> >>>
> >>
> >
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
> >>>
> >>> Many many thanks!
> >>> Best wishes,
> >>> Gorkem
> >> --
> >>
> >> http://home.apache.org/~lewismc/
> >> http://people.apache.org/keys/committer/lewismc
> >
> > --
> >
> > http://home.apache.org/~lewismc/
> > http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Crawling pages behind SSO authentication (SAML/OIDC)

2021-06-09 Thread Lewis John McGibbney
Hi Abhay,

This is a problem space we looked at a while ago and made quite a bit of 
progress on.

Firstly, the protocol-httpclient plugin has been considered in a deprecated 
state for a while.
https://github.com/apache/nutch/tree/master/src/plugin/protocol-httpclient
I'm pretty sure that it will NOT cater for your use case. More information on 
the functionality and limits of this plugin can be found at 
https://cwiki.apache.org/confluence/display/nutch/HttpAuthenticationSchemes 
some more recent initiatives can be found at 
https://cwiki.apache.org/confluence/display/nutch/HttpPostAuthentication

Now, some of the plugins which may be used/adapted for your use case include 

1. https://github.com/apache/nutch/tree/master/src/plugin/protocol-htmlunit - 
customizable through 
https://github.com/apache/nutch/blob/master/src/plugin/lib-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/HtmlUnitWebDriver.java
 

2. both
https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium
https://github.com/apache/nutch/tree/master/src/plugin/protocol-interactiveselenium
some documentation exists at 
https://cwiki.apache.org/confluence/display/NUTCH/AdvancedAjaxInteraction

Admittedly, I've not tried to run these plugins against a modern SSO site 
recently. I suspect that some dependency updates would not go a miss so please 
take that info consideration.

Your note regarding the time it takes for the 'chaining' of systems together to 
achieve the login is well made. This was easily observed and needs a more 
consolidated/calculated approach IMHO.

I would be interested to discuss this further with you...

hth
lewismc

On 2021/06/07 02:45:54, Abhay Ratnaparkhi  wrote: 
> Hello,
> 
> We are using Nutch to crawl intranet pages behind SSO authentication.
> 
> I would like to know if anyone has used/updated httpclient protocol plugin
> for crawling pages behind SSO authentication.
> 
> The SSO auth redirects pages to the SSO server for login and optionally
> asks for second factor authentication like TOTP.
> 
> We have been using a custom plugin (which calls a nodejs service) which
> uses a google puppeteer to drive chromium browser to do this login and OTP
> handling. This is much slower and might not require as many of these pages
> are rendered on server sides (so dynamic rendering isn't required)
> 
> Thank you
> Abhay Ratnaparkhi
> 


Re: Apache Nutch help request for a school project :)

2021-06-07 Thread lewis john mcgibbney
Yep Sebastian is absolutely correct. I sent you a pull request.
https://github.com/gorkemyontem/nutch/pull/1
HTH
lewismc

On Mon, Jun 7, 2021 at 6:18 AM lewis john mcgibbney 
wrote:

> I’ll have a look today. You can always use the mailing list as well. Feel
> free to post your questions there and we will help you out :)
>
> On Sun, Jun 6, 2021 at 12:43 gokmen.yontem 
> wrote:
>
>> Hi Lewis,
>> Sorry to bother you. I've been trying to configure Apache Nutch for
>> almost 10 days now and I'm about to give up. I saw that you are
>> contributing to this project and I thought maybe you can help me.
>> This is how desperate I am :)
>>
>> Here's my repo if you have time:
>> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
>> I'm trying to use docker images so there isn't much on the repo/
>>
>> This is my current error:
>>
>> nutch| Indexer: java.lang.RuntimeException: Indexing job did not
>> succeed, job status:FAILED, reason: NA
>> nutch|  at
>> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
>> nutch|  at
>> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
>> nutch|  at
>> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
>> nutch|  at
>> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
>>
>>
>> People say that schema.xml could be wrong, but I'm using the most up to
>> date one from here
>>
>> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
>>
>>
>> Many many thanks!
>> Best wishes,
>> Gorkem
>>
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Apache Nutch help request for a school project :)

2021-06-07 Thread lewis john mcgibbney
I’ll have a look today. You can always use the mailing list as well. Feel
free to post your questions there and we will help you out :)

On Sun, Jun 6, 2021 at 12:43 gokmen.yontem 
wrote:

> Hi Lewis,
> Sorry to bother you. I've been trying to configure Apache Nutch for
> almost 10 days now and I'm about to give up. I saw that you are
> contributing to this project and I thought maybe you can help me.
> This is how desperate I am :)
>
> Here's my repo if you have time:
> https://github.com/gorkemyontem/nutch/blob/main/docker-compose.yml
> I'm trying to use docker images so there isn't much on the repo/
>
> This is my current error:
>
> nutch| Indexer: java.lang.RuntimeException: Indexing job did not
> succeed, job status:FAILED, reason: NA
> nutch|  at
> org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:150)
> nutch|  at
> org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:291)
> nutch|  at
> org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
> nutch|  at
> org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:300)
>
>
> People say that schema.xml could be wrong, but I'm using the most up to
> date one from here
>
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/schema.xml
>
>
> Many many thanks!
> Best wishes,
> Gorkem
>
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


DuplexWeb-Google - GoogleBot Crawler For Duplex / Google Assistant

2021-06-03 Thread lewis john mcgibbney
Some interesting content for a short read :)

https://www.seroundtable.com/duplexweb-google-bot-31522.html?utm_source=search_engine_roundtable_campaign=ser_newsletter_2021-06-03_medium=email

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Recommendation for free and production-ready Hadoop setup to run Nutch

2021-06-03 Thread lewis john mcgibbney
Hi Sebastian,
If we did not know how long our crawl infrastructure was required for (i.e.
the customer may  revoke or extend the contract with very little notice) we
always chose AWS EMR. Specifically to reduce costs we made sure that all
worker/task nodes were run on spot instances
https://aws.amazon.com/ec2/spot/use-case/emr/ to achieve significant cost
savings on larger deployments. This also means however that we needed to
put in place additional monitoring (Ganglia) and disaster recovery and data
backup logic (custom via hadoop fs and aws aws-cli) but this is good
practice anyway so the small investment was well worth it.
I had contemplated working more on the configuration management side of
things e.g. using Terraform or AWS CloudFormation to drive efficiencies in
repeatable deployments but I never got around to that.
ARM support was never a concern for us so I can't help there sorry.
lewismc

From: Sebastian Nagel 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Tue, 1 Jun 2021 16:35:22 +0200
> Subject: Recommendation for free and production-ready Hadoop setup to run
> Nutch
> Hi,
>
> does anybody have a recommendation for a free and production-ready Hadoop
> setup?
>
> - HDFS + YARN
> - run Nutch but also other MapReduce and Spark-on-Yarn jobs
> - with native library support: libhadoop.so and compression
>libs (bzip2, zstd, snappy)
> - must run on AWS EC2 instances and read/write to S3
> - including smaller ones (2 vCPUs, 16 GiB RAM)
> - ideally,
>- Hadoop 3.3.0
>- Java 11 and
>- support to run on ARM machines
>
> So far, Common Crawl uses Cloudera CDH but with no free updates
> anymore we consider either to switch to Amazon EMR, a Cloudera
> subscription or to use vanilla Hadoop (esp. since only HDFS and YARN
> are required).
>
> A dockerized setup is also an option (at least, for development and
> testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0
> was straight-forward [2]. But native library support is still missing.
>
> Thanks,
> Sebastian
>
> [1] https://github.com/big-data-europe/docker-hadoop
> [2]
> https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11
>
>
>
>
> -- Forwarded message --
> From: Markus Jelsma 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 1 Jun 2021 16:57:46 +0200
> Subject: Re: Recommendation for free and production-ready Hadoop setup to
> run Nutch
> Hello Sebastian,
>
> We have always used vanilla Apache Hadoop on our own physical servers that
> are running on the latest Debian, which also runs on ARM. It will run HDFS
> and YARN and any other custom job you can think of. It has snappy
> compression, which is a massive improvement for large data shuffling jobs,
> it runs on Java 11 and if neccessary even on AWS, but i dislike it.
>
> You can easily read/write large files between HDFS en S3 without storing it
> on local filesystem so it ticks that box too.
>
> I don't know much about Docker, except that i don't like it either, but
> that is personal. I do like vanilla Apache Hadoop.
>
> Regards,
> Markus
>
>
>
> Op di 1 jun. 2021 om 16:35 schreef Sebastian Nagel
> :
>
> > Hi,
> >
> > does anybody have a recommendation for a free and production-ready Hadoop
> > setup?
> >
> > - HDFS + YARN
> > - run Nutch but also other MapReduce and Spark-on-Yarn jobs
> > - with native library support: libhadoop.so and compression
> >libs (bzip2, zstd, snappy)
> > - must run on AWS EC2 instances and read/write to S3
> > - including smaller ones (2 vCPUs, 16 GiB RAM)
> > - ideally,
> >- Hadoop 3.3.0
> >- Java 11 and
> >- support to run on ARM machines
> >
> > So far, Common Crawl uses Cloudera CDH but with no free updates
> > anymore we consider either to switch to Amazon EMR, a Cloudera
> > subscription or to use vanilla Hadoop (esp. since only HDFS and YARN
> > are required).
> >
> > A dockerized setup is also an option (at least, for development and
> > testing). So far, I've looked on [1] - the upgrade to Hadoop 3.3.0
> > was straight-forward [2]. But native library support is still missing.
> >
> > Thanks,
> > Sebastian
> >
> > [1] https://github.com/big-data-europe/docker-hadoop
> > [2]
> >
> https://github.com/sebastian-nagel/docker-hadoop/tree/2.0.0-hadoop3.3.0-java11
> >
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Crawling same domain URL's

2021-05-09 Thread Lewis John McGibbney
Hi Prateek,
mapred.map.tasks -->mapreduce.job.maps
mapred.reduce.tasks  -->mapreduce.job.reduces
You should be able to override in these in nutch-site.xml then publish to your 
Hadoop cluster.
lewismc

On 2021/05/07 15:18:38, prateek  wrote: 
> Hi,
> 
> I am trying to crawl URLs belonging to the same domain (around 140k) and
> because of the fact that all the same domain URLs go to the same mapper,
> only one mapper is used for fetching. All others are just a waste of
> resources. These are the configurations I have tried till now but it's
> still very slow.
> 
> Attempt 1 -
> fetcher.threads.fetch : 10
> fetcher.server.delay : 1
> fetcher.threads.per.queue : 1,
> fetcher.server.min.delay : 0.0
> 
> Attempt 2 -
> fetcher.threads.fetch : 10
> fetcher.server.delay : 1
> fetcher.threads.per.queue : 3,
> fetcher.server.min.delay : 0.5
> 
> Is there a way to distribute the same domain URLs across all the
> fetcher.threads.fetch? I understand that in this case crawl delay cannot be
> reinforced across different mappers but for my use case it's ok to crawl
> aggressively. So any suggestions?
> 
> Regards
> Prateek
> 


Re: Writing Nutch data in Parquet format

2021-05-06 Thread Lewis John McGibbney
Hi Seb,
Really interesting. Thanks for the response. Below

On 2021/05/05 11:42:04, Sebastian Nagel  
wrote: 
> 
> Yes, but not directly - it's a multi-step process. 

As I expected ;)

> 
> This Parquet index is optimized by sorting the rows by a special form of the 
> URL [1] which
> - drops the protocol or scheme
> - reverses the host name and
> - puts it in front of the remaining URL parts (path and query)
> - with some additional normalization of path and query (eg. sorting of query 
> params)
> 
> One example:
>https://example.com/path/search?q=foo=en
>com,example)/path/search?l=en=foo
> 
> The SURT URL is similar to the URL format used by Nutch2
>com.example/https/path/search?q=foo=en
> to address rows in the WebPage table [2]. This format is inspired by the 
> BigTable
> paper [3].  The point is that  cf. [4].

OK, I recognize this data model. Seems logical. 

> Ok, back to the question: both 1) and 2) are trivial if you do not care about
> writing an optimal Parquet files: just define a schema following the methods 
> implementing
> the Writable interface. Parquet is easier to feed into various data 
> processing systems
> because it integrates the schema. The Sequence file format requires that the
> Writable formats are provided - although Spark and other big data tools 
> support
> Sequence files this requirement is sometimes a blocker, also because Nutch
> does not ship a small "nutch-formats" jar.

In my case, the purpose of writing Nutch (Hadoop sequence file) data to Parquet 
format was to facilitate (improved) analytics within the Databricks platform 
which we are currently evaluating.
I'm hesitant to re-use the word 'optimal' because I have not yet benchmarked 
any retrievals but I 'hope' that I can begin to work on 'optimizing' the way 
that Nutch data is written such that it can be analyzed with relative ease 
within, for example Databricks.

> 
> Nevertheless, the price for Parquet is slower writing - which is ok for 
> write-once-read-many
> use cases. 

Yes, this is our use case.

> But the typical use case for Nutch is "write-once-read-twice":
> - segment: read for CrawlDb update and indexing
> - CrawlDb: read during update then replace, in some cycles read for 
> deduplication, statistics, etc.

So sequence files are optimal for use within the Nutch system but for 
additional analytics (on outside platforms such as Databricks) I suspect that 
Parquet would be preferred. 

Maybe we can share more ideas. I wonder if a utility tool to write segments as 
Parquet data would be useful?

Thanks Seb


Writing Nutch data in Parquet format

2021-05-04 Thread Lewis John McGibbney
Hi user@,
Has anyone experimented/accomplished either
1) writing Nutch data directly as Parquet format, or
2) post-processing (Nutch) Hadoop sequence data by converting it to Parquet 
format?
Thank you
lewismc


[ANNOUNCE] Apache Nutch 1.18 Release

2021-01-24 Thread lewis john mcgibbney
*What?*
The Apache Nutch team is pleased to announce the release of Apache Nutch
v1.18.
Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™ data structures.

*Where?*
Source and binary distributions are available for download from the Apache
Nutch download site:
https://nutch.apache.org/downloads.html

Please verify signatures using the KEYS file available at the above
location when downloading the release.

*Further Information*
This release includes ~30 bug fixes and improvements, the full list of
changes can be seen in the release report

https://s.apache.org/lqara

Please also check the changelog for breaking changes:
https://apache.org/dist/nutch/1.18/CHANGES.txt

The Nutch DOAP can be seen at http://nutch.apache.org/doap.rdf

Thanks to everyone who contributed to this release!

lewismc on behalf of the Apache Nutch Project Management Committee

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[RESULT] WAS Re: [VOTE] Release Apache Nutch 1.18 RC1

2021-01-24 Thread lewis john mcgibbney
user@, dev@,
The 72hr VOTE'ing period has elapsed. The RESULT's are as follows

[5] +1 Release this package as Apache Nutch 1.18.

Lewis John McGibbney*
Ralf Kotowski*
Jorge Luis Betancourt Gonzalez*
Sebastian Nagel*
Shashanka Balakuntala Srinivasa*

[0] -1 Do not release this package because…

*Nutch PMC binding VOTE

Thank you to everyone able to VOTE. I'll go ahead and complete the release
process.

lewismc

On Wed, Jan 20, 2021 at 5:22 PM lewis john mcgibbney 
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.18 release is available at [0] where
> accompanying SHA512, ASC and MD5 signatures can also be found.
> Information on verifying releases can be found at [1].
> The release candidate is a .zip and tar.gz archive of the sources in [2]
>
> In addition, a staged maven repository is available at [3]
>
> Please vote on releasing this package as Apache Nutch 1.18. The vote is
> open for at least the next 72 hours and passes if a majority of at least
> three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.18.
> [ ] -1 Do not release this package because…
>
> Cheers,
> lewismc
> P.S. Here is my +1.
>
> [0] https://dist.apache.org/repos/dist/dev/nutch/1.18/
> [1] http://nutch.apache.org/downloads.html#verify
> [2]
> https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=tag;h=a8ef2997d14eb7af95dcafee379d54b31f89dd1a
> [3] https://repository.apache.org/content/repositories/orgapachenutch-1019
> --
> http://home.apache.org/~lewismc/
> http://people.apache.org/keys/committer/lewismc
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[VOTE] Release Apache Nutch 1.18 RC1

2021-01-20 Thread lewis john mcgibbney
Hi Folks,

A first candidate for the Nutch 1.18 release is available at [0] where
accompanying SHA512, ASC and MD5 signatures can also be found.
Information on verifying releases can be found at [1].
The release candidate is a .zip and tar.gz archive of the sources in [2]

In addition, a staged maven repository is available at [3]

Please vote on releasing this package as Apache Nutch 1.18. The vote is
open for at least the next 72 hours and passes if a majority of at least
three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch 1.18.
[ ] -1 Do not release this package because…

Cheers,
lewismc
P.S. Here is my +1.

[0] https://dist.apache.org/repos/dist/dev/nutch/1.18/
[1] http://nutch.apache.org/downloads.html#verify
[2]
https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=tag;h=a8ef2997d14eb7af95dcafee379d54b31f89dd1a
[3] https://repository.apache.org/content/repositories/orgapachenutch-1019
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Extract all image and video links from a web page

2021-01-20 Thread Lewis John McGibbney
Hi Prateek,

On 2021/01/19 15:58:29, prateek  wrote: 
> Is the only other option is to
> override HtmlParseFilter and add a new plugin?

Yes I think it is.

> 
> Also regarding separate objects, what i meant is if i store the image links
> in Outlink, then those links will also be stored in DB (because all outlink
> are stored for next crawl of depth > 1). I don't want to store those in
> crawldb and just output in some other object within the record. I hope this
> makes sense

I understand. Seeing as you cannot upgrade then yes I think you need to 
implement a new plugin to capture the outlinks as a new field in the 
NutchDocument. You should also look into using the 
'parser.html.outlinks.ignore_tags' configuration setting. You can specify which 
tags are filtered.

lewismc


Re: Extract all image and video links from a web page

2021-01-14 Thread lewis john mcgibbney
Hi prateek,
Please see my comment inline below

On Thu, Jan 14, 2021 at 6:39 AM  wrote:

>
> One of the requirements I have is to extract all
> the image and video links from the html in a separate object. Since I have
> the html content, I can use a library like jsoup to parse the content and
> extract img tags.
> I was wondering if there is a way in nutch to do this?
>

The problem here is your requirement of "... in a separate object". Will
this separate object be a new record?


> I am assuming I will have to override HtmlParseFilter class and then add my
> extraction logic there. Is my understanding correct? Any sample code
> reference will be helpful as well.
>
>
I think you can simply add parse-html OR parse-tika AND parse-xsl to the
'plugin.includes' configuration property and then use the ordered
HTMLParseFilter configuration option 'htmlparsefilter.order' as follows
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1599

You can take a look at the parse-xsl plugin
https://github.com/apache/nutch/pull/439/files#diff-bb284524d36ab1d537581c95eb200b98a9e28bb8a8b48329914d2e09f6413d36

N.B. This patch is not yet merged into the Nutch master branch so it is not
available in an official Nutch release. You would need to upgrade to Nutch
1.18-SNAPSHOT master branch and then apply the branch. Any feedback would
be greatly appreciated.

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: NutchTutorial error

2020-09-24 Thread lewis john mcgibbney
Hi Moldable,
Thanks for taking the time to write. Please see my responses inline below

On Wed, Sep 23, 2020 at 10:38 PM Moldable  wrote:

> Hi,
>
> Sorry for the email sent randomly to you. I am trying to report a tiny
> issue in the nutch documentation and I can't figure out how to do it
> without a lot of work. I'm not a developer etc and I don't want to learn
> git and clone the whole wiki or whatever just for this. I went into the
> github but there was no issue tracker so I just looked for someone recently
> active and you were the lucky person.
>
> So please do pass this on to whoever can find it useful.
>
> on  NutchTutorial#Setup_Solr_for_search
> ,
> it says:
>
> Note: due to NUTCH-2745 
> the schema.xml is not contained in the binary package. Please download the
> schema.xml  from the source
> repository.
>
> the link to the xml file is dead. I *think* perhaps the correct link
> might be
> https://raw.githubusercontent.com/apache/nutch/release-1.16/src/plugin/indexer-solr/schema.xml
>  ?
>
>

Thank you for pointing this out. I've corrected the hyperlinks. If there
are any further suggestions then please let me know.


>
> If it's not it would be fantastic to know cause I'm following a tutorial
> elsewhere and I am insufficiently sophisticated to figure these things out
> for myself.
>

You should be good to go with the resources pointed to by the hyperlinks.


>
> and hey, maybe youse'd like to consider an email address or other way to
> contact for people like me... but maybe it's too much work for little gain.
>

We have a number of community mailing lists you can use depending on what
level of involvement you have with the software.
http://nutch.apache.org/mailing_lists.html
I've Cc'd user@nutch.a.o such that everyone else knows that the
documentation has been updated.

Thank you very much for getting in touch.
lewismc


>
> that's all!
>
> thanks for your attention in this matter
>
>
> --
> Securely sent with Tutanota. Get your own encrypted, ad-free mailbox:
> https://tutanota.com
>


Re: Facing Gora exception in Nutch 2.4

2020-09-20 Thread lewis john mcgibbney
Hi Gajalakshmi.G,
Firstly, it's important for me to state that Nutch 2.X is deprecated. No
more development ius happening on the 2.X branch.
That being said, please see my comments inline below

On Thu, Sep 17, 2020 at 7:45 AM  wrote:

>
> I am using Nutch 2.4 with Hadoop 3.1.1


To the best of my knowledge, Nutch 2.4 was never tested against Hadoop 3.x
https://github.com/apache/nutch/blob/release-2.4/ivy/ivy.xml#L49-L61


> and hbase 2.0.2 along with Gora 0.9 version.
>

It was also not tested against Gora 0.9
https://github.com/apache/nutch/blob/release-2.4/ivy/ivy.xml#L107
However Gora 0.9 WAS tested against HBase 2.1.1
https://github.com/apache/gora/blob/apache-gora-0.9/pom.xml#L787
HOWEVER Gora 0.9 WAS NOT tested against Hadoop 3.X


>
> I am getting below error  while trying to run the code:
>

... 


> Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.hbase.io
> .LimitInputStream
> at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
> ... 22 more
>

The ClassNotFoundException is most likely caused by an incompatible
dependency trail as I describe above. If you really wanted to use Nutch 2.X
(which I would advise ag=ainst at this stage) then you would need to update
the dependency chain.

Sorry I can't be of any further help.
Lewis


Re: Nutch 1.17 download available?

2020-06-07 Thread Lewis John McGibbney
Hi Jim,
Response below

On 2020/06/06 14:23:24, Jim Anderson  wrote: 
> 
> I cannot find a download for Nutch 1.17. Is Nutch 1.17 available for
> download? If so, can someone please give me a pointer.
> 

Nutch 1.17 is current master branch e.g. in development, meaning that there is 
no official release as of yet. 

The most recent version of Nutch is 1.16 which you can download from the 
downloads page http://nutch.apache.org/downloads.html

Heads up here, all official releases are automatically archived at 
archive.apache.org. For example, the Nutch releases are available at 
http://archive.apache.org/dist/nutch/

Thanks. Any more questions please let us know :)

lewismc 



Re: [DISCUSS] Release 1.17 ?

2020-04-23 Thread lewis john mcgibbney
Hi Seb,
Go for it. I’ll happily review.
Excellent work folks... really excellent work.
lewismc

On Wed, Apr 22, 2020 at 23:27  wrote:

>
> user Digest 23 Apr 2020 06:27:46 - Issue 3055
>
> Topics (messages 34517 through 34517)
>
> [DISCUSS] Release 1.17 ?
> 34517 by: Sebastian Nagel
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Sebastian Nagel 
> To: d...@nutch.apache.org, user@nutch.apache.org
> Cc:
> Bcc:
> Date: Thu, 23 Apr 2020 08:27:39 +0200
> Subject: [DISCUSS] Release 1.17 ?
> Hi all,
>
> 30 issues are done now
>   https://issues.apache.org/jira/browse/NUTCH/fixforversion/12346090
>
> including a number of important dependency upgrades:
> - Hadoop 3.1 (NUTCH-2777)
> - Elasticsearch 7.3.0 REST client (NUTCH-2739)
> Thanks to Shashanka Balakuntala Srinivasa for both!
>
> Dependency upgrades to be included (but still open right now):
> - Tika 1.24.1
> - Solr 8.5.1
>
> The last release (1.16) was in October, so it's definitely not too early to
> release 1.17.  As usual, we'll check all remaining issues whether they
> should
> be fixed now or can be done later in 1.18.
>
> I would be ready to push a release candidate during the next weeks and have
> already started to work through the remaining issues. Please, comment on
> issues you want to get fixed already in 1.17!
>
> Thanks,
> Sebastian
>
> --
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


[SECURITY] Nutch 2.3.1 affected by downstream dependency CVE-2016-6809

2019-10-14 Thread lewis john mcgibbney
Title: Nutch 2.3.1 affected by downstream dependency CVE-2016-6809

Vulnerable Versions: 2.3.1 (1.16 is not vulnerable)

Disclosure date: 2018-10-22

Credit: Pierre Ernst, Salesforce

Summary: Remote Code Execution in Apache Nutch 2.3.1 when crawling web site
containing malicious content

Description: The reporter found an RCE security vulnerability in Nutch
2.3.1 when crawling a web site that links a doctored Matlab file. This was
due to unsafe deserialization of user generated content. The root cause is
2 outdated 3rd party dependencies: 1. Apache Tika version 1.10
(CVE-2016-6809) 2. Apache Commons Collections 4 version 4.0
(COLLECTIONS-580) Upgrading these 2 dependencies to the latest version will
fix the issue.

Resolution: The Apache Nutch Project Management Committee released Apache
Nutch 2.4 on 2019-10-11 (https://s.apache.org/uw8i3). All users of the 2.X
branch should upgrade to this version immediately. In addition, note that
we expect that v2.4 is the last release on the 2.x series. The Nutch PMC
decided to freeze the development on the 2.x branch for now, as no
committers are actively working on it. See the above hyperlink for more
information on upgrading and the 2.x retirement decision.

Contact: either dev[at] or private[at]nutch[dot]apache[dot]org depending on
the nature of your contact.

Regards lewismc
(On behalf of the Apache Nutch PMC)
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: [VOTE] Release Apache Nutch 2.4 RC#1

2019-10-01 Thread lewis john mcgibbney
Hi Seb,
After purging ~/.ivy2/ and ~/.ant/ all tests pass.
Test crawl cycle with HBase appears fine.
Some more specifics

lmcgibbn@MT-207576 ~/Downloads $ gpg --verify
apache-nutch-2.4-src.tar.gz.asc apache-nutch-2.4-src.tar.gz
gpg: Signature made Mon Sep 23 14:29:05 2019 PDT
gpg:using RSA key FF82A487F92D70E52FF77E0AC66EA7B7DB0A9C6D
gpg: Good signature from "Sebastian Nagel " [unknown]
gpg: WARNING: This key is not certified with a trusted signature!
gpg:  There is no indication that the signature belongs to the
owner.
Primary key fingerprint: FF82 A487 F92D 70E5 2FF7  7E0A C66E A7B7 DB0A 9C6D

lmcgibbn@MT-207576 ~/Downloads $ sha512sum --check
apache-nutch-2.4-src.tar.gz.sha512
apache-nutch-2.4-src.tar.gz: OK

All other files (CHANGES, NOTICE) test out.

+1
Thank you for preparing the RC.
Lewis

On Tue, Oct 1, 2019 at 2:08 AM  wrote:

> From: Sebastian Nagel 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 1 Oct 2019 11:01:02 +0200
> Subject: Re: [VOTE] Release Apache Nutch 2.4 RC#1
> Hi Lewis,
>
> this looks pretty the same as
>   https://issues.apache.org/jira/browse/IVY-1333
> which affected ivy 2.2.0 but should not affect the Nutch
> build as 2.4 is used. The used ivy version is logged
> soon after ant is run:
>   [ivy:resolve] :: Apache Ivy 2.4.0 - 20141213170938 ::
> http://ant.apache.org/ivy/ ::
>
> Are there any ivy libs in
>   ~/.ant/lib/
> because these will take precedence?
>
> In doubt, move away ~/.ivy2/ and ~/.ant/ and rerun the build.
> Also make sure that Java 8 is used.
>
> One point: the 2.4 release package ships
>ivy/ivy-2.2.0.jar
> but it shouldn't be an issue as the ivy jar is loaded including
> the version number. I've opened
>https://issues.apache.org/jira/browse/NUTCH-2741
> to remove it.
>
> Best,
> Sebastian
>
> On 28.09.19 17:54, lewis john mcgibbney wrote:
> > Hi Seb,
> >
> > On Thu, Sep 26, 2019 at 4:37 AM 
> wrote:
> >
> >> From: Sebastian Nagel 
> >> To: user@nutch.apache.org
> >> Cc: d...@nutch.apache.org
> >> Bcc:
> >> Date: Tue, 24 Sep 2019 11:54:48 +0200
> >> Subject: [VOTE] Release Apache Nutch 2.4 RC#1
> >> Hi Folks,
> >>
> >> A first candidate for the Nutch 2.4 release is available at:
> >>   https://dist.apache.org/repos/dist/dev/nutch/2.4/
> >
> >
> > All signatures are good for the tar.gz and zip artifacts.
> > I get an error message and failed build when running 'ant runtime test'
> >
> > BUILD FAILED
> > /Users/lmcgibbn/Downloads/apache-nutch-2.4/build.xml:143: The following
> > error occurred while executing this line:
> > /Users/lmcgibbn/Downloads/apache-nutch-2.4/src/plugin/build.xml:54: The
> > following error occurred while executing this line:
> >
> /Users/lmcgibbn/Downloads/apache-nutch-2.4/src/plugin/build-plugin.xml:213:
> > impossible to resolve dependencies:
> > java.lang.IllegalStateException: impossible to get artifacts when data
> has
> > not been loaded. IvyNode = javax.measure#unit-api;1.0
> > at org.apache.ivy.core.resolve.IvyNode.getArtifacts(IvyNode.java:763)
> > at
> >
> org.apache.ivy.core.resolve.IvyNode.getSelectedArtifacts(IvyNode.java:740)
> > at
> >
> org.apache.ivy.core.report.ResolveReport.setDependencies(ResolveReport.java:235)
> > at
> org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:236)
> > at
> org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:193)
> > at org.apache.ivy.Ivy.resolve(Ivy.java:502)
> > at org.apache.ivy.ant.IvyResolve.doExecute(IvyResolve.java:244)
> > at org.apache.ivy.ant.IvyTask.execute(IvyTask.java:277)
> > at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
> > at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
> > at
> >
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> > at java.lang.reflect.Method.invoke(Method.java:498)
> > at
> >
> org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
> > at org.apache.tools.ant.Task.perform(Task.java:350)
> > at org.apache.tools.ant.Target.execute(Target.java:448)
> > at org.apache.tools.ant.Target.performTasks(Target.java:469)
> > at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
> > at
> >
> org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleCheckExecutor.java:36)
> > at org.apache.tools.ant.Project.executeTargets(Project.java:1260)
> > at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:446)
> > at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java

Re: [VOTE] Release Apache Nutch 2.4 RC#1

2019-09-28 Thread lewis john mcgibbney
Hi Seb,

On Thu, Sep 26, 2019 at 4:37 AM  wrote:

> From: Sebastian Nagel 
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org
> Bcc:
> Date: Tue, 24 Sep 2019 11:54:48 +0200
> Subject: [VOTE] Release Apache Nutch 2.4 RC#1
> Hi Folks,
>
> A first candidate for the Nutch 2.4 release is available at:
>   https://dist.apache.org/repos/dist/dev/nutch/2.4/


All signatures are good for the tar.gz and zip artifacts.
I get an error message and failed build when running 'ant runtime test'

BUILD FAILED
/Users/lmcgibbn/Downloads/apache-nutch-2.4/build.xml:143: The following
error occurred while executing this line:
/Users/lmcgibbn/Downloads/apache-nutch-2.4/src/plugin/build.xml:54: The
following error occurred while executing this line:
/Users/lmcgibbn/Downloads/apache-nutch-2.4/src/plugin/build-plugin.xml:213:
impossible to resolve dependencies:
java.lang.IllegalStateException: impossible to get artifacts when data has
not been loaded. IvyNode = javax.measure#unit-api;1.0
at org.apache.ivy.core.resolve.IvyNode.getArtifacts(IvyNode.java:763)
at
org.apache.ivy.core.resolve.IvyNode.getSelectedArtifacts(IvyNode.java:740)
at
org.apache.ivy.core.report.ResolveReport.setDependencies(ResolveReport.java:235)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:236)
at org.apache.ivy.core.resolve.ResolveEngine.resolve(ResolveEngine.java:193)
at org.apache.ivy.Ivy.resolve(Ivy.java:502)
at org.apache.ivy.ant.IvyResolve.doExecute(IvyResolve.java:244)
at org.apache.ivy.ant.IvyTask.execute(IvyTask.java:277)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:350)
at org.apache.tools.ant.Target.execute(Target.java:448)
at org.apache.tools.ant.Target.performTasks(Target.java:469)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at
org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleCheckExecutor.java:36)
at org.apache.tools.ant.Project.executeTargets(Project.java:1260)
at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:446)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:350)
at org.apache.tools.ant.Target.execute(Target.java:448)
at org.apache.tools.ant.Target.performTasks(Target.java:469)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at
org.apache.tools.ant.helper.SingleCheckExecutor.executeTargets(SingleCheckExecutor.java:36)
at org.apache.tools.ant.Project.executeTargets(Project.java:1260)
at org.apache.tools.ant.taskdefs.Ant.execute(Ant.java:446)
at org.apache.tools.ant.UnknownElement.execute(UnknownElement.java:292)
at sun.reflect.GeneratedMethodAccessor4.invoke(Unknown Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at
org.apache.tools.ant.dispatch.DispatchUtils.execute(DispatchUtils.java:106)
at org.apache.tools.ant.Task.perform(Task.java:350)
at org.apache.tools.ant.Target.execute(Target.java:448)
at org.apache.tools.ant.Target.performTasks(Target.java:469)
at org.apache.tools.ant.Project.executeSortedTargets(Project.java:1399)
at org.apache.tools.ant.Project.executeTarget(Project.java:1370)
at
org.apache.tools.ant.helper.DefaultExecutor.executeTargets(DefaultExecutor.java:41)
at org.apache.tools.ant.Project.executeTargets(Project.java:1260)
at org.apache.tools.ant.Main.runBuild(Main.java:849)
at org.apache.tools.ant.Main.startAnt(Main.java:228)
at org.apache.tools.ant.launch.Launcher.run(Launcher.java:283)
at org.apache.tools.ant.launch.Launcher.main(Launcher.java:101)


Re: Injection from webservice

2019-09-19 Thread lewis john mcgibbney
Hi Folks,
I've implemented what Dave suggested... it is clean and easy but it maybe
not quite as ad-hoc-capable as one would always want. For my use cases it
was acceptable.
More responses inline

On Thu, Sep 19, 2019 at 2:47 PM  wrote:

> From: Jorge Betancourt 
> To: user@nutch.apache.org
> Cc:
> Bcc:
>
>
[snip]


>
> My main concern is if we want to put this additional complexity in Nutch.
> It is really valuable to all of our users to have HTTP/DB/custom injectors
> available out of the box in a pluggable way?
>
> I would love to hear what other people have to say.
>
> In all honesty, I would like to see as much of the REST logic and WebUI
extracted out of the core codebase as possible. I feel like we should have
done it this way around initially but didn't.
Considering 'separation of concerns' for Nutch is important and Jorge, your
spot on with your reservations.

Lewis


Mavenize Nutch Build as Google Summer of Code

2019-03-09 Thread lewis john mcgibbney
Hi user@ and dev@,
If you are a student and would like to tackle the task of Mavenizing the
Nutch master build please get in touch with me here, directly or comment on
the following issue
https://issues.apache.org/jira/plugins/servlet/mobile#issue/NUTCH-2292
Thank you
Lewis
-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: mapred.child.java.opts

2018-12-08 Thread Lewis John McGibbney
Hi Hany,
Yes the paramater is set to 1GB by default but it should also be noted that 
this configuration key is actually deprecated as of some time ago. Seeing as we 
are using the 'new' MapReduce API, I suspect we should use 
'mapreduce.map.java.opts` and `mapreduce.reduce.java.opts` instead so that is 
something we need to update.
Can you please provide a patch for this and submit it against the 1.x branch?

Now to answer your question, essentially these configuration parameters enable 
you to tune the heap-size for child jvms of maps and reduces respectively. In 
the content of Nutch this might be useful if certain crawl phases consume more 
heap memory e.g. parsing. This will ultimately be crawl-specific.
HTH
Lewis

On 2018/12/07 14:08:59, hany.n...@hsbc.com wrote: 
> Hello,
> 
> While checking the Nutch (1.15) crawl bash file, I noticed at line 211 that 
> 1000MB is statically set for java - > mapred.child.java.opts=-Xmx1000m
> 
> Any idea why?, Can I change it?, What will be the impact?
> Kind regards,
> Hany Shehata
> Enterprise Engineer
> Green Six Sigma Certified
> Solutions Architect, Marketing and Communications IT
> Corporate Functions | HSBC Operations, Services and Technology (HOST)
> ul. Kapelanka 42A, 30-347 Kraków, Poland
> __
> 
> Tie line: 7148 7689 4698
> External: +48 123 42 0698
> Mobile: +48 723 680 278
> E-mail: hany.n...@hsbc.com
> __
> Protect our environment - please only print this if you have to!
> 
> 
> 
> -
> SAVE PAPER - THINK BEFORE YOU PRINT!
> 
> This E-mail is confidential.  
> 
> It may also be legally privileged. If you are not the addressee you may not 
> copy,
> forward, disclose or use any part of it. If you have received this message in 
> error,
> please delete it and all copies from your system and notify the sender 
> immediately by
> return E-mail.
> 
> Internet communications cannot be guaranteed to be timely secure, error or 
> virus-free.
> The sender does not accept liability for any errors or omissions.
> 


Re: [ask] Crawl Forum Site

2018-12-03 Thread lewis john mcgibbney
Hi Tukang,
In short yes. It would help if you could provide an example of what you've
tried and what you encountered/what your results were.
Lewis

On Mon, Dec 3, 2018 at 6:42 PM  wrote:

>
> From: tkg_cangkul 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 04 Dec 2018 09:40:47 +0700
> Subject: [ask] Crawl Forum Site
> Hi,
>
> Is there possible to crawling Web Forum with Apache Nutch?
> If possible, is there any configuration that i must add?
> I've try it but i've nothing.
>
> Pls help . Need advice.
>
> Thanks
>
> Best Regards,
> Tukang Cangkul
>
>

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Apache Nutch vs Multiple elasticsearch nodes

2018-11-28 Thread lewis john mcgibbney
Hi Marcello,
I don't think this is correct no!
first however, I really suggest that we upgrade the Jest client in this
plugin. The most recent one is 6.3.1 and we are using 2.0.3.
Please see https://issues.apache.org/jira/browse/NUTCH-2677, if you are
able to provide a patch and test it out it would be great. Please see my
response inline below

On Wed, Nov 28, 2018 at 6:42 AM  wrote:

> From: Marcello Lorenzi 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 28 Nov 2018 15:41:45 +0100
> Subject: Apache Nutch vs Multiple elasticsearch nodes
> Hi All,
> we installed the latest version of Apache Nutch to crawl some HTML pages
> but we tested all the operations with a single Elasticsearch instance. We
> use the Elasticsearch REST index writer but into the "host" parameter we
> configure the string "es-elk-pr01.test.local, es-elk-sv01.test.local" the
> JEST client has been started with only 1 server.
>
>  INFO AbstractJestClient:56 - Setting server pool to a list of 1 servers: [
> http://es-elk-pr01.test.local , es-elk-sv01.test.local:9200]
>
> Is it correct this behavior?
>

Please see the following message,
https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java#L110
I think you should configure your host in index-writers.xml, specifically
see

https://github.com/apache/nutch/blob/master/conf/index-writers.xml.template#L127-L150

HTH
Lewis


>
> Thanks in advance,
> Marcello
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: webapp for Nutch deploy mode

2018-10-18 Thread Lewis John McGibbney
Hi Gahanna,
Response inline

On 2018/10/12 07:40:50, Gajanan Watkar  wrote: 
> Hi all,
> I am using Nutch 2.3.1 with Hbase-1.2.3 as storage backend on top of
> Hadoop-2.5.2 cluster in *deploy mode* with crawled data being indexed to
> solr-6.5.1.
> I want to use *webapp* for creating, controlling and monitoring crawl jobs
> in deploy mode.
> 
> With Hadoop cluster, Hbase and nutchserver started, when I tried to launch
> Crawl Job through webapp interfaces InjectorJob failed.
> It was happening  due to seed directory being created on local filesystem.
> I fixed it by moving it to same path on HDFS by editing *createSeedFile*
> method in *org.apache.nutch.api.resources.SeedResource.java*.
> 
> public String createSeedFile(SeedList seedList) {
> if (seedList == null) {
>   throw new WebApplicationException(Response.status(Status.BAD_REQUEST)
>   .entity("Seed list cannot be empty!").build());
> }
> File seedFile = createSeedFile();
> BufferedWriter writer = getWriter(seedFile);
> 
> Collection seedUrls = seedList.getSeedUrls();
> if (CollectionUtils.isNotEmpty(seedUrls)) {
>   for (SeedUrl seedUrl : seedUrls) {
> writeUrl(writer, seedUrl);
>   }
> }
> 
> 
> * //method to copy seed directory to HDFS: Gajanan*
> *copyDataToHDFS(seedFile);*
> 
> return seedFile.getParent();
>   }

I was aware of this some time ago and never found the time to fix it. I just 
checked JIRA as well and there is no ticket for addressing the task however I 
am certain that it has been discussed on this mailing list previously.
Anyway, can you please create an issue in JIRA labeling it as affecting 2.x and 
tag it with both "REST_api" and "web gui" and submit this as a pull request. It 
would be a huge help.
> 
> Then I was able to go upto index phase where it complained of not having
> set *solr.server.url* java property.
> *I set JAVA_TOOL_OPTIONS to include -Dsolr.server.url property.*
> 
> *Crawl Job is is still failing with:*
> 18/10/11 10:07:03 ERROR impl.RemoteCommandExecutor: Remote command failed
> java.util.concurrent.TimeoutException
> at java.util.concurrent.FutureTask.get(FutureTask.java:205)
> at
> org.apache.nutch.webui.client.impl.RemoteCommandExecutor.executeRemoteJob(RemoteCommandExecutor.java:61)
> at
> org.apache.nutch.webui.client.impl.CrawlingCycle.executeCrawlCycle(CrawlingCycle.java:58)
> at
> org.apache.nutch.webui.service.impl.CrawlServiceImpl.startCrawl(CrawlServiceImpl.java:69)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at
> org.springframework.aop.support.AopUtils.invokeJoinpointUsingReflection(AopUtils.java:317)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.invokeJoinpoint(ReflectiveMethodInvocation.java:190)
> at
> org.springframework.aop.framework.ReflectiveMethodInvocation.proceed(ReflectiveMethodInvocation.java:157)
> at
> org.springframework.aop.interceptor.AsyncExecutionInterceptor$1.call(AsyncExecutionInterceptor.java:97)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:748)
> 
> I tried to change default timeout in
> *org.apache.nutch.webui.client.impl.RemoteCommandExecutor
> *
> 
> private static final int *DEFAULT_TIMEOUT_SEC = 300;  *//Can be increased
> if required

There are various issues also about this in JIRA. Can you please check them out 
and let me know if you can find the correct on. Maybe the following? 
https://issues.apache.org/jira/browse/NUTCH-2313
> 
> *Summary:*
> *But in all this, what i am wondering about is:*
> *1. No webpage table is being created in hbase corresponding to crawl ID.*

Again, please check JIRA for this information, there may already be something 
logged which will indicate what is wrong.

> *2. How in that case it goes upto Index phase of crawl.*

It shouldn't!

> 
> *Finally actual question:*
> 
> *How do I get my crawl jobs running in deploy mode using nutch webapp.
> What else I need to do. Am I missing something very basic.*

As far as I can remember this functionality has not been baked in... or else it 
may have been baked in but it is within 2.x from Git. Please check out the code 
from Git and try it there... your results may differ.

Lewis


Re: Unable to get regex-urlfilter working

2018-10-11 Thread lewis john mcgibbney
Hi Gajanan,
Seeing as you are using 2.x, are you making sure that the project has been
built with the correct   regex-urlfilter.txt being present on ClassPath and
included in the job jar you are using?

On Thu, Oct 11, 2018 at 12:19 AM  wrote:

>
>
> From: Gajanan Watkar 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 10 Oct 2018 17:19:24 +0530
> Subject: Re: Unable to get regex-urlfilter working
> I am using Nutch 2.x with habse as backend storage.
>
> *-Gajanan*
>


Uneven HBase region sizes WAS Re: Nodemanager crashing repeatedly

2018-09-19 Thread lewis john mcgibbney
Hi Gajanan,
CC dev@gora, this is something we may wish to implement within HBase.
If anything I've provided below is incorrect, then please correct the
record.
BTW, I found the following article written by Elis, to be extremely useful
https://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/

On Wed, Sep 19, 2018 at 3:55 AM  wrote:

> From: Gajanan Watkar 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 19 Sep 2018 16:24:52 +0530
> Subject: Re: Nodemanager crashing repeatedly
> Hi Lewis,
> It appears that my setup was infected. After studying ResourceManager logs
> closely I found that lot of jobs were getting submitted to my cluster as
> user "dr who". Moreover my crontab was listing 2 wget cron jobs I never
> configured (Suspect it to be cryptocurrency miner) and one java app running
> from /var/tmp/java. I Configured firewall, blocked port 8088, purged cron
> (as it was coming back with every re-install) and removed java app from
> /var/tmp/java. It seem to have stabilized my setup. For now it is working
> fine. No more unexpected NodeManager Exits. Also applied patch for
> MalformedURLException.
>

Good to hear that you were able to debug this. From the description you
provided I wondered if it was anything to do with Nutch 2.x namely because
I've never experienced anything like this in the past.


>
> I am getting uneven region sizes, can you suggest me on pre-spliting
> webpage table i.e. split points to be used and splitting policy and optimum
> GC setup for regionserver for efficient Nutch crawling.
>
>
Can you provide the version of HBase you are using? Assuming that you are
using Nutch 2.x branch from Git, you should be using 1.2.6.
Can you also provide the logging from HBase which indicates uneven region
sizes?

>From what I understand (and I am no HBase expert) when Gora first creates
the HBase table, by default, only one region is allocated for the table.
This means that initially, all requests will go to a single region server,
regardless of the number of region servers in your HBase deployment. A
knock on effect of this is that initial phases of loading data into the
empty Webpage table cannot utilize the whole capacity of the Base cluster
however I don't think this is by any means your issue.

The issue at hand is concerned with supplying the split points at the table
creation time which would hopefully resolve the uneven region size. The
comment I made above regarding Gora creating only one region allocation for
the table is correct, take a look at [0], you will see that we do not use
additional parameters for the call to Admin.createTable which would
explicitly specify for example the split points. Examples of additional
parameters which could be used when creating our Table are below, these can
also be seen at [1].

void createTable(HTableDescriptor desc)
Creates a new table.

void createTable(HTableDescriptor desc, byte[][] splitKeys)
Creates a new table with an initial set of empty regions defined by the
specified split keys.

void createTable(HTableDescriptor desc, byte[] startKey, byte[] endKey, int
numRegions)
Creates a new table with the specified number of regions.

void createTableAsync(HTableDescriptor desc, byte[][] splitKeys)
Creates a new table but does not block and wait for it to come online.

One other issue you asked was regarding the split policy... again, we do
not currently specify an explicit split policy, instead we utilize the auto
splitting capability (which I believe is ConstantSizeRegionSplitPolicy)
made available by HBase, however, if we wanted to implement an explicit
split policy, we could do so by implementing the code below at the
following line [2] within Gora's HBaseStore#createSchema method.

HTableDescriptor tableDesc = new HTableDescriptor("example-table");
tableDesc.setValue(HTableDescriptor.SPLIT_POLICY,
AwesomeSplitPolicy.class.getName()); //add columns etc
admin.createTable(tableDesc);

OR, we could make this configurable by providing the
'hbase.regionserver.region.split.policy' available within gora.properties.
There are a few ways we could prototype this.

Finally, regarding GC, I am not entirely sure right now. I don't know too
much about HBase optimization but just like any distributed system you
could tinker with GC values until you land at something which works. The
above however hopefully gets you started in the right direction.

hth
Lewis

[0]
https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L182
[1]
https://hbase.apache.org/1.2/apidocs/org/apache/hadoop/hbase/client/Admin.html
[2]
https://github.com/apache/gora/blob/master/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java#L180


Re: Nodemanager crashing repeatedly

2018-09-06 Thread lewis john mcgibbney
Hi Gajanan,
Which OS are you running this on?
I would also suggest that if you want to use the 2.x codebase, you should
use the most recent from SCM e.g. check out master and change to 2.x branch.
Finally, for now at least, you didn't mention the phase at which the crawl
is failing. Can you provide this?

On Thu, Sep 6, 2018 at 8:58 AM  wrote:

> From: Gajanan Watkar 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 05 Sep 2018 11:27:21 +0530
> Subject: Nodemanager crashing repeatedly
> I am running Nutch-2.3.1 over Hadoop-2.5.2 and Hbase-1.2.3 with
> integration to Solr-6.5.1. I have crawled over 10 million pages. But
> while doing all this I am continuously facing two problems:
>
> 1. My Nodemanager is crashing repeatedly during different phases of
> crawl. It crashes my linux session and forces logout with nodemanager
> killed. I log-in again, restart NodeManger and the same failed crawl
> phase runs to success. [Nodemanager log has nothing to report]
>
> 2. I am running all my crawl phases one by one without crawl script, as
> with crawl script most of the time my jobs were exiting with
> "WaitForjobCompletion" error at different stages of crawl. So, I
> decided to go ahead with one by one method which prevented
> "WaitForjobCompletion" to occure.
>
> Any help will be highly appreciated. New to mailing-list, New to Nutch.
>
> -Gajanan
>
>


Re: redirect bin/crwal log output to some other file

2018-09-06 Thread lewis john mcgibbney
Hi Amarnatha,
There are a couple of options which I can think of.
1. Why don't you just set up a simple daemon to watch hadoop.log and
generate a subsequent stream writing it to /tmp/myurls.log e.g. tail -f
hadoop.log > /tmp/myurls.log
2. Check out confirmation/log4j.properties, you will see the configuration
for Hadoop.log in there. 'Maybe' you can change this location, rebuild your
deployment and it will solve your issue.
I'm sure there ate several other ways as well.
hth
Lewis

On Thu, Sep 6, 2018 at 8:58 AM  wrote:

> From: Amarnatha Reddy 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 4 Sep 2018 22:13:58 +0530
> Subject: redirect bin/crwal log output to some other file
> Hi All,
>
> We are using bin/crawl  command to crawl and index data into solr,
> currently the output is writing into default logs/hadoop.log file, so my
> requirement is how can i log data writing into different file
>
>
> bin/crawl -i -D solr.server.url=http://localhost:8983/solr/jeepkr -s urls/
> crawl/ 1  -->this will write log details under default path logs/hadoop.log
>
> How can i write log path by passing as part of bin/crawl?
>
> ex: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/jeepkr -s
> urls/ crawl/ 1  >/tmp/myurls.log
> --
>
>


Re: IndexWriter interface in 1.15

2018-09-06 Thread lewis john mcgibbney
Hi Yossi,

REASON: Upgrade of MapReduce API from legacy to 'new'. This was a breaking
change for sure and a HUGE patch. We did not however factor in the
non-braking aspects of the upgrade... so it has not all been plain sailing.
PROPOSED SOLUTION: I tend to agree with you that this should be addd as a
breaking change to the current master CHANGES.txt and should be consulted
when people pull a new release. We cannot add this to the release artifacts
however. We would need to roll a new release (1.15.1). If you feel that
this is enough of a reason to roll a new release (which I do not) then
please go ahead and do so.

This is a lesson learned and I can honestly say that it was the result of
us trying to make the upgrade as clean as possible without leaving too much
of the deprecated MR API still around. Maybe this could have however been
phased out across several releases...

Lewis

On Tue, Sep 4, 2018 at 8:53 AM  wrote:

>
> user Digest 4 Sep 2018 15:53:01 - Issue 2929
>
> Topics (messages 34147 through 34147)
>
> IndexWriter interface in 1.15
> 34147 by: Yossi Tamari
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Yossi Tamari 
> To: 
> Cc:
> Bcc:
> Date: Tue, 4 Sep 2018 18:52:54 +0300
> Subject: IndexWriter interface in 1.15
> Hi,
>
>
>
> I missed it at the time, but I just realized (the hard way) that the
> IndexWriter interface was changed in 1.15 in ways that are not backward
> compatible.
>
> That means that any custom IndexWriter implementation will no longer
> compile, and probably will not run either.
>
> I think this was a mistake (maybe a new interface should have been created,
> and the old one deprecated and supported for now, or just the old methods
> deprecated without change, and the new methods provided with a default
> implementation), but it's too late now.
>
> I still think this is something that should be highlighted in the release
> note for 1.15 (meaning at the top, as "breaking changes").
>
> The main changes I encountered:
>
> 1.  setConf and getConf were removed from the interface (without
> deprecation).
> 2.  open was deprecated (that's fine), and its signature was changed
> (from JobConf to Configuration), which means it a completely different
> function technically, and there is no point in the deprecation.
>
>
>
> Yossi.
>
>

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Nutch Maven support for plugins

2018-08-29 Thread lewis john mcgibbney
Hi Rustam,
There have been some efforts to Mavenize the entire build system. These all
died. If you look on JIRA you will see the relevant tickets for the most
recent implementation
https://issues.apache.org/jira/browse/NUTCH-2292
Our current build does not publish the Nutch plugins are Maven artifacts.
Lewis

On Wed, Aug 29, 2018 at 1:30 AM  wrote:

>
> From: Rustam 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 29 Aug 2018 10:29:45 +0200
> Subject: Nutch Maven support for plugins
> It seems Nutch is available in Maven, but without its plugins.
> Would it be possible to publish Nutch plugins in Maven as well?
> Without the plugins it's kind of useless.
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.

2018-08-16 Thread lewis john mcgibbney
Hi Puneet
Responses inline

On Wed, Aug 15, 2018 at 7:20 AM  wrote:

>
> From: Puneet Dhanda 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 15 Aug 2018 10:02:12 -0400
> Subject: Nutch 2.3.1 with Mongo datastore - No Document is getting indexed.
> Hi,
>
> I am using the Nutch- 2.3.1 with MongoDB as the datastore.


Are you using it from SCM or the release? If I were you I would use from
SCM, we fixed a few bugs in there.


> While crawling
> the sites, getting the following error. Please assist what could be wrong
> here.
>
> Hadoop.log exception
> 2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - Retrying
> request
> 2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - I/O exception
> (java.net.ConnectException) caught when processing request: Connection
> refused (Connection refused)
> 2018-08-15 09:56:42,139 INFO  httpclient.HttpMethodDirector - Retrying
> request
> 2018-08-15 09:56:42,242 ERROR httpclient.Http - Failed with the following
> error:
> java.net.ConnectException: Connection refused (Connection refused)
> at java.net.PlainSocketImpl.socketConnect(Native Method)
> at
> java.net
> .AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
> at
> java.net
> .AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
> at
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
> at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
> 2018-08-15 09:56:46,409 INFO  fetcher.FetcherJob - 0/0 spinwaiting/active,
> 2 pages, 2 errors, 0.4 0 pages/s, 0 0 kb/s, 0 URLs in 0 queues
>

You may wish to use the parser checker tooling to ensure that you can reach
the 2 failed URLs without executing a full crawl
https://wiki.apache.org/nutch/bin/nutch%20parsechecker
Also, you can try setting DEBUG or TRACE logging for this tool, see
 https://github.com/apache/nutch/blob/2.x/conf/log4j.properties#L40
Lewis


[RESULT] was [VOTE] Release Apache Nutch 1.15 RC#1

2018-08-07 Thread lewis john mcgibbney
Excellent. Thanks for taking on release manager Seb, it’s making a huge
impact. Nice work folks.

On Tue, Aug 7, 2018 at 05:37  wrote:

>
> user Digest 7 Aug 2018 12:37:25 - Issue 2921
>
> Topics (messages 34124 through 34124)
>
> [RESULT] was [VOTE] Release Apache Nutch 1.15 RC#1
> 34124 by: Sebastian Nagel
>
> Administrivia:
>
> -
> To post to the list, e-mail: user@nutch.apache.org
> To unsubscribe, e-mail: user-digest-unsubscr...@nutch.apache.org
> For additional commands, e-mail: user-digest-h...@nutch.apache.org
>
> --
>
>
>
>
> -- Forwarded message --
> From: Sebastian Nagel 
> To: user@nutch.apache.org
> Cc: d...@nutch.apache.org
> Bcc:
> Date: Tue, 7 Aug 2018 14:37:14 +0200
> Subject: [RESULT] was [VOTE] Release Apache Nutch 1.15 RC#1
> Hi Folks,
>
> thanks to everyone who was able to review the release candidate!
>
> 72 hours have passed, please see below for vote results.
>
> [4] +1 Release this package as Apache Nutch 1.15
>Roannel Fernández Hernández *
>Govind Nitk
>Markus Jelsma *
>Sebastian Nagel *
>
> [0] -1 Do not release this package because ...
>
> * Nutch PMC
>
> The VOTE passes with 3 binding votes from Nutch PMC members.
>
> I'll continue and publish the release packages. Tomorrow, after the
> packages have been propagated to all mirrors, I'll send the announcement.
>
> Thanks to everyone who has contributed to Nutch and the 1.15 release.
>
> Sebastian
>
> --
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: [ANNOUNCE] New Nutch committer and PMC - Omkar Reddy

2018-06-21 Thread lewis john mcgibbney
Excellent. Good job Omkar.

On Thu, Jun 21, 2018 at 1:18 AM,  wrote:

>
>
> From: Sebastian Nagel 
> To: d...@nutch.apache.org
> Cc: user@nutch.apache.org
> Bcc:
> Date: Thu, 21 Jun 2018 10:18:01 +0200
> Subject: [ANNOUNCE] New Nutch committer and PMC - Omkar Reddy
> Dear all,
>
> it is my pleasure to announce that Omkar Reddy has joined us
> as a committer and member of the Nutch PMC. Omkar has worked
> on upgrading Nutch to use the new MapReduce API as part of his
> Google Summer of Code project last year.
>
> Thanks, Omkar, and congratulations on your new role within the
> Apache Nutch community! And thanks for your contributions and
> efforts so far, hope to see more!
>
> Welcome on board!
>
> Sebastian
> (on behalf of the Nutch PMC)
>
>
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: No internet connection in Nutch crawler: Proxy configuration -PAC file

2018-04-23 Thread lewis john mcgibbney
Hi Patricia,
I've never used a proxy auto-config (PAC) method for proxying anything
before. The PAC is defined as "...Proxy auto-configuration (PAC): Specify
the URL for a PAC file with a JavaScript function that determines the
appropriate proxy for each URL. This method is more suitable for laptop
users who need several different proxy configurations, or complex corporate
setups with many different proxies."
Right now, the public guidance for using Nutch with a proxy goes as far as
the following tutorial
https://wiki.apache.org/nutch/SetupProxyForNutch
Right now, Nutch does not support the reading of PAC files... I think you
would need to add this functionality.
Lewis

On Sun, Apr 22, 2018 at 10:31 AM,  wrote:

>
> From: Patricia Helmich 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Fri, 20 Apr 2018 10:31:42 +
> Subject: No internet connection in Nutch crawler: Proxy configuration -PAC
> file
> Hi,
>
> I am using Nutch and it used to work fine. Now, some internet
> configurations changed and I have to use a proxy. In my browser, I specify
> the proxy by providing a PAC file to the option "Automatic proxy
> configuration URL". I was searching for a similar option in Nutch in the
> conf/nutch-default.xml file. I do find some proxy options (http.proxy.host,
> http.proxy.port, http.proxy.username, http.proxy.password,
> http.proxy.realm) but none seems to be the one I am searching for.
>
> So, my question is: where can I specify the PAC file in the Nutch
> configurations for the proxy?
>
> Thanks for your help,
>
> Patricia
>
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Nutch fetching times out at 3 hours, not sure why.

2018-04-18 Thread lewis john mcgibbney
Hi Chip,
Which version of Nutch are you using?

On Tue, Apr 17, 2018 at 7:45 AM,  wrote:

> From: Chip Calhoun 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Tue, 17 Apr 2018 14:45:01 +
> Subject: Nutch fetching times out at 3 hours, not sure why.
> I crawl a list of roughly 2600 URLs all on my local server, and I'm only
> crawling around 1000 of them. The fetcher quits after exactly 3 hours (give
> or take a few milliseconds) with this message in the log:
>
> 2018-04-13 15:50:48,885 INFO  fetcher.FetchItemQueues - * queue:
> https://history.aip.org >> dropping!
>
> I've seen that 3 hours is the default in some Nutch installations, but
> I've got my fetcher.timelimit.mins set to -1. I'm sure I'm missing
> something obvious. Any thoughts would be greatly appreciated. Thank you.
>
> Chip Calhoun
> Digital Archivist
> Niels Bohr Library & Archives
> American Institute of Physics
> One Physics Ellipse
> College Park, MD  20740-3840  USA
> Tel: +1 301-209-3180
> Email: ccalh...@aip.org
> https://www.aip.org/history-programs/niels-bohr-library
>
>
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: any23 2.2 upgrading in NUTCH gives errors

2018-04-02 Thread lewis john mcgibbney
Hi Govind,

Please scope out https://github.com/apache/nutch/pull/306
Let me know how things go.
Lewis

On Mon, Apr 2, 2018 at 4:45 AM,  wrote:

>
>
> From: govind nitk 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Mon, 2 Apr 2018 17:15:38 +0530
> Subject: any23 2.2 upgrading in NUTCH gives errors
> Hi,
>
> Tried to upgrade any23 2.1 to 2.2 in nutch code base.
>
> Changes:
> 1. src/plugin/any23/ivy.xml:
>  conf="*->default">
>
> 2. src/plugin/any23/plugin.xml
>
> 
> 
> 
> 
> 
>
>
> after "ant runtime",
> below jar files are present in dir runtime/local/plugins/any23
>
> any23.jar
> apache-any23-api-2.2.jar
> apache-any23-core-2.2.jar
> apache-any23-csvutils-2.2.jar
> apache-any23-encoding-2.2.jar
> apache-any23-mime-2.2.jar
>
>
>
>
> Did simple parse checker on a test html. Getting Errors as
> 1.  java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
> org/eclipse/rdf4j/common/lang/service/ServiceRegistry
>  
> Caused by: java.lang.NoClassDefFoundError: org/eclipse/rdf4j/common/lang/
> service/ServiceRegistry
>
> 2. java.util.concurrent.ExecutionException: java.lang.NoClassDefFoundError:
> org/apache/any23/extractor/ExtractorRegistryImpl
> ...
> Caused by: java.lang.NoClassDefFoundError: org/apache/any23/extractor/
> ExtractorRegistryImpl
>
>
>
>
>
>


Re: index-metadata, lowercasing field names?

2018-03-07 Thread lewis john mcgibbney
Patch it Markus.

On Wed, Mar 7, 2018 at 1:58 PM,  wrote:

>
> From: Markus Jelsma 
> To: User 
> Cc:
> Bcc:
> Date: Wed, 7 Mar 2018 11:24:09 +
> Subject: index-metadata, lowercasing field names?
> Hi,
>
> I've got metadata, containing a capital in the field name. But
> index-metadata lowercases its field names:
>   parseFieldnames.put(metatag.toLowerCase(Locale.ROOT), metatag);
>
> This means index-metadata is useless if your metadata fields contain
> uppercase characters. Was this done for a reason?
>
> If not, i'll patch it up.
>
>


Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

2018-02-12 Thread lewis john mcgibbney
Hi David,
The java.lang.NoClassDefFoundError issues could be resolved simply by
including the correct Jar artifacts.
 We will have the issue resolved correctly very soon and I will let you
know when Any23 2.2 is released.
Lewis

On Sat, Feb 10, 2018 at 11:42 AM, <user-digest-h...@nutch.apache.org> wrote:

> From: David Ferrero <david.ferr...@zion.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Sat, 10 Feb 2018 12:41:57 -0700
> Subject: Re: NUTCH-1129, Any23, microdata parsing, indexing, and
> extraction?
> Awesome on Any23 2.2 forthcoming release. I look forward to it and
> subsequent bump to Nutch.
>
> In the meantime, I was successful to build Any23 from master, then copy
> the any23 jars into Nutch (master) then reference them in the plugin…
> 
> 
> 
> 
> 
>
> Unfortunately when I reran the nutch parsechecker it failed to parse
> anymore. A quick look at the logs/hadoop.log reveal that updated any23
> depends on new classes in the other jar files:
> Caused by: java.lang.NoClassDefFoundError: org/apache/commons/rdf/api/IRI
> Caused by: java.lang.NoClassDefFoundError: Could not initialize class
> org.semanticweb.owlapi.rio.OWLAPIRDFFormat
> java.util.concurrent.ExecutionException: java.lang.NoSuchMethodError:
> org.jsoup.select.NodeTraversor.traverse(Lorg/
> jsoup/select/NodeVisitor;Lorg/jsoup/nodes/Node;)V
>
> I guess I would need to rebuild nutch from master (rather than just copy a
> few jar files) and ensure that any23’s jar dependencies as also references..
>
> > On Feb 9, 2018, at 1:45 PM, Lewis John McGibbney <lewi...@apache.org>
> wrote:
> >
> > Hi David,
> > We are in the process of releasing Any23 2.2, this will include the fix.
> > We can then come back to Nutch and make the upgrade and you should be
> all set.
> > Hopefully this will be achieved within around 72hrs. In the meantime,
> you can clone, build and deploy Any23 master. This will do the trick.
> > Lewis
> >
> > On 2018/02/09 07:31:10, David Ferrero <david.ferr...@zion.com> wrote:
> >> Thank you for this information. Since this is very much related to
> Any23 and microdata parsing, I’m going to ask what I believe is a related
> question but keep this same thread so it will be organized in one place:
> >>
> >> I noticed a lot of job boards such as dice.com <http://dice.com/>,
> monster.com <http://monster.com/>, etc use http://schema.org/JobPosting <
> http://schema.org/JobPosting> information, however many seem to use
> 

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

2018-02-09 Thread Lewis John McGibbney
Hi David,
We are in the process of releasing Any23 2.2, this will include the fix.
We can then come back to Nutch and make the upgrade and you should be all set.
Hopefully this will be achieved within around 72hrs. In the meantime, you can 
clone, build and deploy Any23 master. This will do the trick.
Lewis

On 2018/02/09 07:31:10, David Ferrero <david.ferr...@zion.com> wrote: 
> Thank you for this information. Since this is very much related to Any23 and 
> microdata parsing, I’m going to ask what I believe is a related question 
> but keep this same thread so it will be organized in one place:
> 
> I noticed a lot of job boards such as dice.com <http://dice.com/>, 
> monster.com <http://monster.com/>, etc use http://schema.org/JobPosting 
> <http://schema.org/JobPosting> information, however many seem to use  type="application/ld+json”>… rather than RDF.
> Summer 2017, Google announced structured data guidance for Jobs:
> https://developers.google.com/search/docs/data-types/job-posting 
> <https://developers.google.com/search/docs/data-types/job-posting>
> and a testing tool to validate your HTML: 
> https://search.google.com/structured-data/testing-tool
> I verified a few sample listings on the above mentioned job boards on 
> google’s testing-tool and they validate OK.
> 
> So after looking at http://any23.apache.org/getting-started.html 
> <http://any23.apache.org/getting-started.html> for the supported extractors, 
> I see Any23 mentions it supports JSON+LD input, so I added this to 
> nutch-site.xml to override the same property in nutch-default.xml:
> 
> 
> any23.extractors
> html-microdata,html-embedded-jsonld,rdf-jsonld
> Comma-separated list of Any23 extractors (a list of 
> extractors is available here: 
> http://any23.apache.org/getting-started.html)
> 
> 
> I expected to see additional information from nutch parsechecker after adding 
> the jsonld extractors, however I see NO changes to Any23-Triples microdata 
> parsed. 
> 
> What might I be doing wrong?
> 
> > On Feb 8, 2018, at 11:17 AM, lewis john mcgibbney <lewi...@apache.org> 
> > wrote:
> > 
> > Hi David,
> > Answers inline
> > 
> > On Thu, Feb 8, 2018 at 9:19 AM, <user-digest-h...@nutch.apache.org> wrote:
> > 
> >> 
> >> From: David Ferrero <david.ferr...@zion.com>
> >> To: user@nutch.apache.org
> >> Cc:
> >> Bcc:
> >> Date: Thu, 8 Feb 2018 10:19:52 -0700
> >> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
> >> Pull request #205 was recently merged into master branch for Nutch 1.x in
> >> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
> >> 
> >> I am new to nutch and solr and have just started crawling and indexing a
> >> few select websites. Using the built in html parsing/indexing, I am getting
> >> searchable fields like url, content, host, sometimes a title, and a few
> >> other indexing related fields like digest, boost, segment, and tstamp. That
> >> said, I realized very quickly that I need better results. While exploring
> >> the source of the website, I noticed references to schema.org and get
> >> excited by what I see. That’s how I stumbled upon NUTCH-1129.
> >> 
> >> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 
> >> parser/indexer.
> >> 
> > 
> > Excellent.
> > 
> > 
> >> 
> >> Q: Now what?  How do I gain Any23 microdata parsing / indexing
> >> capabilities introduced by NUTCH-1129?
> >> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
> >> plugin.includes with something like parse-(html | tika |
> >> any23)|index-(basic | anchor | any23)
> >> 
> > 
> > No, you just add 'any23' to the list of plugins within the plugin.includes
> > property of nutch-site.xml
> > 
> > 
> >> Q: How do I expose the discovered microdata structure / items to end-user
> >> such as Solr? For example, what are the microdata items and do I need to
> >> map them to Solr in solrindex-mapping.xml?
> >> 
> > 
> > OK, so current configuration for the Any23 plugin, is to store extracted
> > structured data markup in the Nutch Metadata object with a key "
> > Any23-Triples". You can locate it using something like the ParserChekcer
> > tool provided via the 'nutch' script. Liekwise you can also locate it, as a
> > representation of what would be indexed, by using the IndexerChecker
> > tooling also provided within the 'nutch' script.
> > 
> > An example

Re: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?

2018-02-08 Thread lewis john mcgibbney
Hi David,
Answers inline

On Thu, Feb 8, 2018 at 9:19 AM,  wrote:

>
> From: David Ferrero 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Thu, 8 Feb 2018 10:19:52 -0700
> Subject: NUTCH-1129, Any23, microdata parsing, indexing, and extraction?
> Pull request #205 was recently merged into master branch for Nutch 1.x in
> fulfillment of NUTCH-1129 "microdata for Nutch 1.x"
>
> I am new to nutch and solr and have just started crawling and indexing a
> few select websites. Using the built in html parsing/indexing, I am getting
> searchable fields like url, content, host, sometimes a title, and a few
> other indexing related fields like digest, boost, segment, and tstamp. That
> said, I realized very quickly that I need better results. While exploring
> the source of the website, I noticed references to schema.org and get
> excited by what I see. That’s how I stumbled upon NUTCH-1129.
>
> I’ve built apache-nutch-1.15-SNAPSHOT which includes Any23 parser/indexer.
>

Excellent.


>
> Q: Now what?  How do I gain Any23 microdata parsing / indexing
> capabilities introduced by NUTCH-1129?
> Q: Do I replace parse-(html | tika)|index-(basic | anchor) in
> plugin.includes with something like parse-(html | tika |
> any23)|index-(basic | anchor | any23)
>

No, you just add 'any23' to the list of plugins within the plugin.includes
property of nutch-site.xml


> Q: How do I expose the discovered microdata structure / items to end-user
> such as Solr? For example, what are the microdata items and do I need to
> map them to Solr in solrindex-mapping.xml?
>

OK, so current configuration for the Any23 plugin, is to store extracted
structured data markup in the Nutch Metadata object with a key "
Any23-Triples". You can locate it using something like the ParserChekcer
tool provided via the 'nutch' script. Liekwise you can also locate it, as a
representation of what would be indexed, by using the IndexerChecker
tooling also provided within the 'nutch' script.

An example would be as follows, data is now indexed as follows (example
after crawling https://smartive.ch/jobs):


  "structured_data": [
{
  "node": "",
  "value": "\"IE-edge,chrome=1\"@de",
  "key": "",
  "short_key": "X-UA-Compatible"
},
{
  "node": "",
  "value": "\"Wir sind smartive \\u2014 eine dynamische,
innovative Schweizer Webentwicklungsagentur. Die Realisierung
zeitgem\\u00E4sser Webl\\u00F6sungen geh\\u00F6rt genauso zu unserer
Passion, wie die konstruktive Zusammenarbeit mit unseren Kundinnen und
Kunden.\"@de",
  "key": "",
  "short_key": "description"
},
{
  "node": "",
  "value": "\"width=device-width, initial-scale=1,
shrink-to-fit=no\"@de",
  "key": "",
  "short_key": "viewport"
},
{
  "node": "",
  "value": "\"width=device-width,initial-scale=1\"@de",
  "key": "",
  "short_key": "viewport"
},
{
  "node": "",
  "value": "\"ie=edge\"@de",
  "key": "",
  "short_key": "x-ua-compatible"
}
  ],


Note from above, that the 'predicate' key field is very useful for quickly
filtering through, for example, Hotel Ratings, or something similar.


>
> I’d also be interested to learn how to point at a specific URL and see how
> nutch sees the microdata (best case), then learn how to leverage this into
> nutch and finally into solr.
>
>
See the tooling for ParserChecker and IndexerChecker as explained above.
Any further question, please let me know.
Lewis


Re: Can I use protocol-selenium with https?

2018-01-15 Thread lewis john mcgibbney
Hi Sheon,
It looks like HTTPS is not currently supported
https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java#L63
I can't recall if we were ever successful in adapting the plugin for HTTPS
so I can't advise further.
Lewis

On Mon, Jan 15, 2018 at 1:13 AM,  wrote:

>
> From: sheon banks 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Mon, 15 Jan 2018 02:31:15 +
> Subject: Can I use protocol-selenium with https?
> Hi all,
>
>
> I would like to crawl an https website using protocol-selenium. Can I do
> this?  If so,  can someone provide the configuration steps.
>
>
> Sent from Outlook
>
>


-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc


Re: Getting Error

2018-01-11 Thread lewis john mcgibbney
I unfortunately do not use the OpenJDK so i don't know if this is where
your issue stems from.
All of your config looks absolutely fine.
Lewis

On Thu, Jan 11, 2018 at 8:26 AM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: govind nitk <govind.n...@gmail.com>
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 10 Jan 2018 14:06:53 +0530
> Subject: Re: Getting Error
> $java -version
> openjdk version "1.8.0_141"
> OpenJDK Runtime Environment (build 1.8.0_141-8u141-b15-3~14.04-b15)
> OpenJDK 64-Bit Server VM (build 25.141-b15, mixed mode)
>
>
> config edits:
>
> Gora properties:
> gora.datastore.default=org.apache.gora.mongodb.store.MongoStore
> gora.mongodb.override_hadoop_configuration=false
> gora.mongodb.mapping.file=/gora-mongodb-mapping.xml
> gora.mongodb.servers=localhost:27017
> gora.mongodb.db=crawler
> #gora.mongodb.login=login
> #gora.mongodb.secret=secret
>
>
> nutch-site.xml:
>   
> storage.data.store.class
> org.apache.gora.mongodb.store.MongoStore
> Default class for storing data
>   
>
>
> mongod running on default port: 27017.
>
>
> And before generating snapshot , uncommented the goa backend to use mongo.
> as:
>  conf="*->default" />
>
>
> Am I missing anything else?
>
>
> regards,
> govind
>
>
>
> On Wed, Jan 10, 2018 at 12:31 PM, govind nitk <govind.n...@gmail.com>
> wrote:
>
> >
> > hi Lewis,
> >
> > uname -a: Linux data 4.4.0-108-generic #131~14.04.1-Ubuntu SMP Sun Jan 7
> > 15:54:10 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux
> >
> > On Tue, Jan 9, 2018 at 7:56 PM, lewis john mcgibbney <lewi...@apache.org
> >
> > wrote:
> >
> >> Hi govind,
> >> Very strange. Which operating system are you using?
> >> Lewis
> >>
> >> On Tue, Jan 9, 2018 at 5:15 AM, <user-digest-h...@nutch.apache.org>
> >> wrote:
> >>
> >> > From: govind nitk <govind.n...@gmail.com>
> >> > To: user@nutch.apache.org
> >> > Cc:
> >> > Bcc:
> >> > Date: Tue, 9 Jan 2018 15:45:08 +0530
> >> > Subject: Getting Error
> >> > Hi,
> >> >
> >> > 1. running nutch compiled from branch 2.x. Build succeed.
> >> > 2. using mongo as db storage. changed the storage.data.store.class to
> >> point
> >> > to mongo class.
> >> >
> >> >
> >> > Getting this error while running nutch inject /tmp/urls/seeds.txt ?
> >> >
> >> >
> >> > Error: A JNI error has occurred, please check your installation and
> try
> >> > again
> >> > Exception in thread "main" java.lang.VerifyError: Bad type on operand
> >> stack
> >> > Exception Details:
> >> >   Location:
> >> > org/apache/nutch/crawl/InjectorJob.run(Ljava/util/Map;)
> >> Ljava/util/Map;
> >> > @85: putfield
> >> >   Reason:
> >> > Type 'org/apache/nutch/util/NutchJob' (current frame, stack[1])
> is
> >> not
> >> > assignable to 'org/apache/hadoop/mapreduce/Job'
> >> >   Current Frame:
> >> > bci: @85
> >> > flags: { }
> >> > locals: { 'org/apache/nutch/crawl/InjectorJob', 'java/util/Map',
> >> > 'org/apache/hadoop/fs/Path', 'java/lang/Object' }
> >> > stack: { 'org/apache/nutch/crawl/InjectorJob',
> >> > 'org/apache/nutch/util/NutchJob' }
> >> >   Bytecode:
> >> > 0x000: 2ab6 0004 1205 b800 06b6 0007 2b12 09b9
> >> > 0x010: 000a 0200 4e2d c100 0b99 000b 2dc0 000b
> >> > 0x020: 4da7 000f bb00 0b59 2db6 000c b700 0d4d
> >> > 0x030: 2a04 b500 0e2a 03b5 000f 2a2a b600 04bb
> >> > 0x040: 0010 59b7 0011 1212 b600 132c b600 14b6
> >> > 0x050: 0015 b800 16b5 0017 2ab4 0017 2cb8 0018
> >> > 0x060: 2ab4 0017 1219 b600 1a2a b400 1712 1bb6
> >> > 0x070: 001c 2ab4 0017 121d b600 1e2a b400 1712
> >> > 0x080: 1fb6 0020 2ab4 0017 b600 2112 1b12 1db8
> >> > 0x090: 0022 3a04 2ab4 0017 1904 04b8 0023 2ab4
> >> > 0x0a0: 0017 b600 21b8 0024 3a05 b200 25bb 0010
> >> > 0x0b0: 59b7 0011 1226 b600 1319 05b6 0014 1227
> >> > 0x0c0: b600 13b6 0015 b900 2802 002a b400 1712
> >> > 0x0d0: 29b6 002a 2ab4 0017 03b6 002b 2ab4 0017
> >> > 0x0e0: 04b6 002c 5701 2ab4 0017 2ab4 002d b8

Re: Getting Error

2018-01-09 Thread lewis john mcgibbney
Hi govind,
Very strange. Which operating system are you using?
Lewis

On Tue, Jan 9, 2018 at 5:15 AM,  wrote:

> From: govind nitk 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 9 Jan 2018 15:45:08 +0530
> Subject: Getting Error
> Hi,
>
> 1. running nutch compiled from branch 2.x. Build succeed.
> 2. using mongo as db storage. changed the storage.data.store.class to point
> to mongo class.
>
>
> Getting this error while running nutch inject /tmp/urls/seeds.txt ?
>
>
> Error: A JNI error has occurred, please check your installation and try
> again
> Exception in thread "main" java.lang.VerifyError: Bad type on operand stack
> Exception Details:
>   Location:
> org/apache/nutch/crawl/InjectorJob.run(Ljava/util/Map;)Ljava/util/Map;
> @85: putfield
>   Reason:
> Type 'org/apache/nutch/util/NutchJob' (current frame, stack[1]) is not
> assignable to 'org/apache/hadoop/mapreduce/Job'
>   Current Frame:
> bci: @85
> flags: { }
> locals: { 'org/apache/nutch/crawl/InjectorJob', 'java/util/Map',
> 'org/apache/hadoop/fs/Path', 'java/lang/Object' }
> stack: { 'org/apache/nutch/crawl/InjectorJob',
> 'org/apache/nutch/util/NutchJob' }
>   Bytecode:
> 0x000: 2ab6 0004 1205 b800 06b6 0007 2b12 09b9
> 0x010: 000a 0200 4e2d c100 0b99 000b 2dc0 000b
> 0x020: 4da7 000f bb00 0b59 2db6 000c b700 0d4d
> 0x030: 2a04 b500 0e2a 03b5 000f 2a2a b600 04bb
> 0x040: 0010 59b7 0011 1212 b600 132c b600 14b6
> 0x050: 0015 b800 16b5 0017 2ab4 0017 2cb8 0018
> 0x060: 2ab4 0017 1219 b600 1a2a b400 1712 1bb6
> 0x070: 001c 2ab4 0017 121d b600 1e2a b400 1712
> 0x080: 1fb6 0020 2ab4 0017 b600 2112 1b12 1db8
> 0x090: 0022 3a04 2ab4 0017 1904 04b8 0023 2ab4
> 0x0a0: 0017 b600 21b8 0024 3a05 b200 25bb 0010
> 0x0b0: 59b7 0011 1226 b600 1319 05b6 0014 1227
> 0x0c0: b600 13b6 0015 b900 2802 002a b400 1712
> 0x0d0: 29b6 002a 2ab4 0017 03b6 002b 2ab4 0017
> 0x0e0: 04b6 002c 5701 2ab4 0017 2ab4 002d b800
> 0x0f0: 2e2a b400 17b6 002f 1230 1231 b600 32b9
> 0x100: 0033 0100 3706 2ab4 0017 b600 2f12 3012
> 0x110: 34b6 0032 b900 3301 0037 08b2 0025 bb00
> 0x120: 1059 b700 1112 35b6 0013 1608 b600 36b6
> 0x130: 0015 b900 2802 00b2 0025 bb00 1059 b700
> 0x140: 1112 37b6 0013 1606 b600 36b6 0015 b900
> 0x150: 2802 002a b400 2db0
>   Stackmap Table:
> append_frame(@36,Top,Object[#148])
> full_frame(@48,{Object[#149],Object[#150],Object[#151],
> Object[#148]},{})
>
> at java.lang.Class.getDeclaredMethods0(Native Method)
> at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
> at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
> at java.lang.Class.getMethod0(Class.java:3018)
> at java.lang.Class.getMethod(Class.java:1784)
> at sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
> at sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
>
>
>
> Regards,
> govind
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: upgrading Selenium is causing errors

2018-01-03 Thread lewis john mcgibbney
Hi Sheon,
Assuming that you are using Nutch master branch, please read
https://github.com/apache/nutch/blob/master/src/plugin/lib-selenium/howto_upgrade_selenium.txt

https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium

Make the relevant dependency updates and repackage the Nutch source.
Lewis

On Wed, Jan 3, 2018 at 9:26 AM,  wrote:

>
> From: sheon banks 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Wed, 3 Jan 2018 09:26:06 +
> Subject: upgrading Selenium is causing errors
> I have upgrade the selenium and htmlutil libraries with the latest version
> of seleniumn(3.8.1).  When I run a fetch I recieve the following error...
>
>
> Fetcher: throughput threshold retries: 5
> FetcherThread 41 fetch of https://localhost/ failed with: 
> java.lang.NoClassDefFoundError:
> Could not initialize class org.openqa.selenium.json.Json
> FetcherThread 41 has no more work available
> FetcherThread 41 -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=0
>
> Can someone help me let me know what I am doing wrong?
>
> shena
>
>
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Nutch 2.x does not send index to ElasticSearch 2.3.3

2017-12-26 Thread lewis john mcgibbney
Hi Devil,
Do your logs indicate any issues?
Lewis

On Mon, Dec 25, 2017 at 5:41 PM,  wrote:

>
> -- Forwarded message --
> From: devil devil 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Fri, 22 Dec 2017 21:24:51 +0100
> Subject: Nutch 2.x does not send index to ElasticSearch 2.3.3
> Hello,
> I am running nutch 2.x and elasticsearch 2.3.3 in two containers. I
> can log into nutch container and curl E.S. so connectivity is there.
> Inject/Fetch/etc all work fine. However when i get to nutch index
> elasticsearch, all i get is:
>
> root@b211135e1be5:~/nutch/bin# ./nutch index elasticsearch -all
> IndexingJob: starting
> Active IndexWriters :
> ElasticIndexWriter
>  elastic.cluster : elastic prefix cluster
> elastic.host : hostname
> elastic.port : port  (default 9300)
> elastic.index : elastic index command
> elastic.max.bulk.docs : elastic bulk index doc counts. (default
> 250)
> elastic.max.bulk.size : elastic bulk index length. (default
> 2500500 ~2.5MB)
>
>I tried various E.S. versions and various combinations of settings, but
> still getting nowhere.
>My elasticsearch.conf is empty (should I have something here?)
>Below is my nutch-site.xml (I was using indexer-elastic before but was
> getting the "No indexwriters found" errors. Then I saw there is
> indexer-elastic2 plugin)
>
>


[ANNOUNCE] Apache Gora 0.8 Release

2017-09-20 Thread lewis john mcgibbney
Hi Folks,

The Apache Gora team are pleased to announce the immediate availability of
Apache Gora 0.8.

The Apache Gora open source framework provides an in-memory data model and
persistence for big data. Gora supports persisting to

   - column stores,
   - key value stores,
   - document stores,
   - distributed in-memory key/value stores,
   - in-memory data grids,
   - in-memory caches,
   - distributed multi-model stores, and
   - hybrid in-memory architectures

Gora also enables analysis of data with extensive Apache Hadoop™ MapReduce
and Apache Spark™ support. Gora uses the Apache Software License v2.0.

Gora is released as both source code, downloads for which can be found at
our downloads page [0] as well as Maven artifacts which can be found on
Maven central [1].
The DOAP file for Gora can be found here [2]

This release addresses a modest 35 issues with the addition of new
datastore for OrientDB and Aerospike. The full Jira release report can be
found here [3].

Suggested Gora database support is as follows


   - Apache Avro  1.8.1
   - Apache Hadoop  2.5.2
   - Apache HBase  1.2.3
   - Apache Cassandra  3.11.0 (Datastax Java
   Driver 3.3.0)
   - Apache Solr  6.5.1
   - MongoDB  (driver) 3.5.0
   - Apache Accumlo  1.7.1
   - Apache Spark  1.4.1
   - Apache CouchDB  1.4.2 (test containers
    1.1.0)
   - Amazon DynamoDB  (driver) 1.10.55
   - Infinispan  7.2.5.Final
   - JCache  1.0.0 with Hazelcast
    3.6.4 support.
   - OrientDB  2.2.22
   - Aerospike  4.0.6


Thank you

Lewis

(on behalf of Gora PMC)

[0] http://gora.apache.org/downloads.html
[1] http://search.maven.org/#search|ga|1|g%3A%22org.apache.gora%22
[2] https://svn.apache.org/repos/asf/gora/committers/doap_Gora.rdf
[3] https://s.apache.org/3YdY

--
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Request for Review

2017-09-06 Thread lewis john mcgibbney
Hi user@ and dev@,

As part of the Nutch Google Summer of Code effort this year, Omkar Reddy
and I have been working persistently throughout the summer months on the
Hadoop MapReduce API upgrade e.g. NUTCH-2375 Upgrade the code base from
org.apache.hadoop.mapred to org.apache.hadoop.mapreduce [0].
We believe we are now at a stage where this code is stable and should be
opened for widespread community review. It is a large patch, so the more
eyes we can get on this the better. Upgrading MapReduce API usage in Nutch
is long overdue so this will be a significant addition to the Nutch project.

The proposed pull request can be found at [1]. Please report any outcomes
back to the issue tracker at [1].

Thank you
Lewis

N.B. Please note that the official version of Apache Hadoop supported by
Nutch master branch at this time is 2.7.2.

[0] https://issues.apache.org/jira/projects/NUTCH/issues/NUTCH-2375
[1] https://github.com/apache/nutch/pull/188

-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: nutch server with different configs

2017-08-14 Thread lewis john mcgibbney
Hi Raziyeh,
Please see
https://wiki.apache.org/nutch/NutchRESTAPI#Configuration
Once you've created your new config, you can use it as follows
https://wiki.apache.org/nutch/NutchRESTAPI#Create_job
Lewis

On Fri, Aug 11, 2017 at 12:23 AM,  wrote:

>
> From: Raziyeh Farjamfard 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Thu, 10 Aug 2017 15:30:28 +0430
> Subject: nutch server with different configs
> Hello everybody,
>
>
> I want to run one nutch server and use multiple config directory. I want to
> use each of these directories while running a job, for example I want to
> run inject job-1 with configDir-1 and run inject job-2 with configDir-2.
>
>
> I read this post
> https://stackoverflow.com/questions/9673315/is-there-a-
> way-to-run-nutch-with-different-configuration-files,
> but my question is how could i set these directory paths? I use nutch rest
> and I don’t know how could set this.
>
>
>


Re: I'm just going to throw this out there...

2017-08-14 Thread lewis john mcgibbney
Hi Ray,
Apart from not being able to find a tutorial, what is wrong exactly?
New users of Nutch are advised to use the Nutch 1.X series.
The Nutch 2.X tutorial introduces more moving parts. This is well
documented on this mailing list for a number of years now.
If you can enumerate what is wrong, we will help you out.
Thanks
Lewis

On Sun, Aug 13, 2017 at 8:49 PM,  wrote:

>
> From: Ray Crawford 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Sun, 13 Aug 2017 23:48:59 -0400
> Subject: I'm just going to throw this out there...
> And it may get me banned, but so be it.
>
> I've ben trying to get a Nutch/Solr setup running and, after many hours of
> cruising StackOverflow, this list and many documentation sites which talked
> about various versions, I've got nothing to show for it.
>
> Why is this so complex and why is a reasonable set of documentation about
> how to integrate the solutions so hard to find?
>
> Can anyone point me to an ACCURATE Nutch 2.3/Solr tutorial?  If some one
> can help me here, I'll write a Chef cookbook that automates the whole
> thing.  However, I can't get any of the tutorials I've tried so far to
> work.
>
> Thanks and hopefully the community will help me (and others) work through
> this or absolve me of my apparent ignorance.
>
> - Ray.
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: nutch 1.x tutorial with solr 6.6.0

2017-07-12 Thread lewis john mcgibbney
Hi Folks,
I just updated the tutorial below, if you find any discrepancies please let
me know.

https://wiki.apache.org/nutch/NutchTutorial

Also, I have made available a new schema.xml which is compatible with Solr
6.6.0 at

https://issues.apache.org/jira/browse/NUTCH-2400

Please scope it out and let me know what happens.
Thank you
Lewis

On Wed, Jul 12, 2017 at 6:58 AM,  wrote:

>
> From: Pau Paches [mailto:sp.exstream.t...@gmail.com]
> Sent: Tuesday, July 11, 2017 2:50 PM
> To: user@nutch.apache.org
> Subject: [EXTERNAL] Re: nutch 1.x tutorial with solr 6.6.0
>
> Hi Rashmi,
> I have followed your suggestions.
> Now I'm seeing a different error.
> bin/nutch solrindex http://127.0.0.1:8983/solr crawl/crawld -linkdb
> crawl/linkdb crawl/segments The input path at segments is not a segment...
> skipping
> Indexer: starting at 2017-07-11 20:45:56
> Indexer: deleting gone documents: false


...


Re: nutch 1.x tutorial with solr 6.6.0

2017-07-09 Thread lewis john mcgibbney
Hi Pau,

On Sat, Jul 8, 2017 at 6:52 AM,  wrote:

> From: Pau Paches 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Sat, 8 Jul 2017 15:52:46 +0200
> Subject: nutch 1.x tutorial with solr 6.6.0
> Hi,
> I have run the Nutch 1.x Tutorial with Solr 6.6.0.
> Many things do not work,


What does not work? Can you elaborate?


> there is a mismatch between the assumed Solr
> version and the current Solr version.
>

We support Solr as an indexing backend in the broadest sense possible. We
do not aim to support the latest and greatest Solr version available. If
you are interested in upgrading to a particular version, if you could open
a JIRA issue and provide a pull request it would be excellent.


> I have seen some messages about the same problem for Solr 4.x
> Is this the right path to go or should I move to Nutch 2.x?


If you are new to Nutch, I would highly advise that you stick with 1.X


> Does it
> make sense to use Solr 6.6 with Nutch 1.x?


Yes... you _may_ have a few configuration options to tweak but there have
been no backwards incompatibility issues so I see no reason for anything to
be broken.


> If yes, I'm willing to
> amend the tutorial if someone helps.
>
>
What is broken? Can you elaborate?


Re: Custom Plugin Resources Files

2017-06-29 Thread lewis john mcgibbney
Hi Dave,
Does this need to be done in parsing phase? Parsing is already an IO
intensive process... could you possible do it at another phase?
Right now, the only plugin I can think of which ships with Nutch source,
and which consults an external resource (not packaged with Nutch) is the
index-geoip plugin [0]. This works in distributed mode.
Please also consider looking into the parsefilter-naivebayes [1] which
loads in a prebuild model [2] as a resource which is then obviously used
the filtering.
hth
Lewis

[0] https://github.com/apache/nutch/tree/master/src/plugin/index-geoip
[1]
https://github.com/apache/nutch/tree/master/src/plugin/parsefilter-naivebayes
[2]
https://github.com/apache/nutch/blob/master/src/plugin/parsefilter-naivebayes/src/java/org/apache/nutch/parsefilter/naivebayes/NaiveBayesParseFilter.java#L132-L137

On Thu, Jun 29, 2017 at 8:29 AM,  wrote:

>
>
> From: SJC Multimedia 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Thu, 29 Jun 2017 08:28:54 -0700
> Subject: Custom Plugin Resources Files
> I am building a custom plugin in Nutch 2.3.1 on Hadoop/HBase. In the plugin
> code, I need to pull in a dictionary of files and run some comparisons
> while parsing the document.
>
> Is there a way to include directory of files through the custom plugin ant
> build framework that will work on both local and cluster(hadoop MR) mode?
>
> Any pointers will be helpful.
>
> Thanks
> Dave
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: ERROR: Cannot run job worker!

2017-06-24 Thread lewis john mcgibbney
Hi Vyacheslav,
Thanks for the update, can you please open a ticket at
https://issues.apache.org/jira/projects/NUTCH
If you are able to submit a pull request at https://github.com/apache/nutch/,
it would be appreciated.
Lewis

On Sat, Jun 24, 2017 at 9:36 AM,  wrote:

>
> From: Vyacheslav Pascarel 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Fri, 23 Jun 2017 13:07:39 +
> Subject: RE: [EXTERNAL] - Re: ERROR: Cannot run job worker!
> Hi Lewis,
>
> I think I narrowed the problem to SelectorEntryComparator class nested in
> GeneratorJob. In debugger during crash I noticed there a single instance of
> SelectorEntryComparator shared across multiple reducer tasks. The class is
> inherited from org.apache.hadoop.io.WritableComparator that has a few
> members unprotected for concurrent usage. At some point multiple threads
> may access those members in WritableComparator.compare call. I modified
> SelectorEntryComparator and it seems solved the problem but I am not sure
> if the change is appropriate and/or sufficient (covers GENERATE only?)
>
> Original code:
> 
>
>   public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
>   }
>
> Modified code:
> 
>   public static class SelectorEntryComparator extends WritableComparator {
> public SelectorEntryComparator() {
>   super(SelectorEntry.class, true);
> }
>
> @Override
> synchronized public int compare(byte[] b1, int s1, int l1, byte[] b2,
> int s2, int l2) {
> return super.compare(b1, s1, l1, b2, s2, l2);
> }
>   }
>
>


Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-22 Thread lewis john mcgibbney
Hi Vyacheslav,
Can you provide me and example page with http refresh tag included? I'll
try comparing behaviour between 2.X and master.
Thank you
Lewis

On Sat, Jun 17, 2017 at 9:25 AM,  wrote:

> From: Vyacheslav Pascarel 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Fri, 16 Jun 2017 13:18:16 +
> Subject: RE: [EXTERNAL] - Re: Outlinks field is not populated when page
> from seed URL when fetched page contains "refresh" meta tag
> It is 2.3.1.
>
>


Re: ERROR: Cannot run job worker!

2017-06-21 Thread lewis john mcgibbney
Hi Vyacheslav,

Which version of Nutch are you using? 2.x?
lewis

On Wed, Jun 21, 2017 at 10:32 AM,  wrote:

>
>
> From: Vyacheslav Pascarel 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Wed, 21 Jun 2017 17:32:15 +
> Subject: ERROR: Cannot run job worker!
> Hello,
>
> I am writing an application that performs web site crawling using Nutch
> REST services. The application:
>
>
>


Re: Outlinks field is not populated when page from seed URL when fetched page contains "refresh" meta tag

2017-06-15 Thread lewis john mcgibbney
Hi Vyacheslav,

On Thu, Jun 15, 2017 at 1:41 AM,  wrote:

>
> From: Vyacheslav Pascarel 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Wed, 14 Jun 2017 22:15:49 +
> Subject: Outlinks field is not populated when page from seed URL when
> fetched page contains "refresh" meta tag
> Hello,
>
> I am trying to crawl http://www.msnbc.com/ but having problem to get
> anything else beside the original seed URL. The INJECT/GENERATE/FETCH steps
> complete without problems but after executing PARSE I see only one outlink
> pointing to the original seed URL:
>
> ...
Which version of Nutch are you using?
Lewis


Re: Optimize Nutch Indexing Speed

2017-06-15 Thread lewis john mcgibbney
Hi Dennis,

On Thu, Jun 15, 2017 at 1:41 AM,  wrote:

>
> From: Dennis A 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Wed, 14 Jun 2017 20:45:35 +0200
> Subject: Re: Optimize Nutch Indexing Speed
> Hi Lewis,
> thank you for your suggestions!
>

No problems at all.

...

> My current
> investigations point me to the fact that the temporary folder has only
> about 4.5GB of disk space remaining, which might be the reason for a
> collapse, since I managed to estimate the size on at least 2.5-3GB for the
> smaller configuration.
> I plan to move this to another folder where more disk space is remaining.
>

Please also consider the 'hadoop.tmp.dir' configuration parameter... this
should should be set to a path where this is lots/enough of disk space...
Hadoop intermediate data structures reside locally on disk, you need to
accommodate this.


>
> What I could sadly not find, is the option to increase the number of
> mappers/reducers for the tasks. I deducted (seemingly correct) that the
> actual hadoop-site.xml and mapred-site.xml configurations can (or more:
> have) be done in the nutch-site.xml file?
>

Yes they can... but please make sure they are not also override within the
nutch script
https://github.com/apache/nutch/blob/master/src/bin/crawl#L116-L118
Please investigate and adapt for your particular environment.


> My problem now is: For the fetching and generation step, the machine seems
> to utilize many cores in parallel, and htop does show me multiple threads,
> probably the Hadoop mappers.
>

Please see above, please also look at the following

https://github.com/apache/nutch/blob/master/src/bin/crawl#L120-L122

as well as anything else 'fetch'-related within that script.


>
> Yet, for the parsing step (which is now the longest part with around 1h), I
> only notice one major thread. Since I do already notice multiple threads
> for the former, I am unsure whether this can be parallelized in the local
> execution mode, or whether this is only possible for
> pseudo-distributed/distributed mode.
> Do the linked properties possibly resolve this problem, too? Or would this
> only further increase the number of executors for the fetch/parse steps?
>

Please investigate the nutch script... it will give you a body of insight
into the crawl cycle as well as the limitations. Nutch is a batch oriented
system.. it has limitations. Once you understand them, you can mitigate to
a certain extent or leverage them for your benefit.


>
> Sorry that I ask, but I do not yet have so much experience with crawling at
> all :/
>
>
No problem at all.
Lewis


Re: Optimize Nutch Indexing Speed

2017-06-14 Thread lewis john mcgibbney
Hi Dennis,

On Sun, Jun 11, 2017 at 2:45 AM,  wrote:

>
> From: Dennis A 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Fri, 9 Jun 2017 09:59:05 +0200
> Subject: Optimize Nutch Indexing Speed
> Hello,
> I have recently configured my Nutch crawler to index a whole domain, with
> an estimated number of 1.5M-3M documents.
>
> For this purpose, I wanted to use Nutch 1.13 and Solr 4.10.4 to build a
> search index over these documents. The compute server is a 16 core Xeon
> Server with 128GB RAM.
> While everything has worked for subdomain crawls quite well, I noticed some
> severe drawbacks once I put it on the whole domain:
> - The solr indexing failed without any obvious reason if I did not lower
> the -topN value to 40k instead of 50k documents.
>

Did this possibly fail on SolrClean/Clean task instead of indexing task? If
so, then you've encountered https://issues.apache.org/jira/browse/NUTCH-2269.
I would suggest you possibly upgrade to master branch to work around this
or else desable Clean for the time being.


> - The CrawlDb and LinkDb merging steps take an unreasonable long amount of
> time after only 150k indexed documents (~7 crawl iterations). For the
> latest step, it took over 8 hours.


This is way too long. Have you tried profiling the tasks? How are you
running Nutch? Local, pseudo-distributed or distributed? I would look more
closely into your logs with DEBUG on to see what is going on. I would also
profile the task to see exactly where the tasks are struggling. re you
filtering and normalizing? If so do you have some complex rules in there
which may be decreasing performance?


> I noticed that it does seem to only
> utilize on core on the machine, which seems weird to me. I also already
> increased the Java Heap Size to 5GB (from default 1GB), but did not notice
> any imminent improvements.
>

Please check the following
https://stackoverflow.com/questions/8357296/full-utilization-of-all-cores-in-hadoop-pseudo-distributed-mode#8359416
See if any of this applies.


>
> My questions would be:
> - As an alternative to the server, I have access to a cluster of 4/5 nodes
> with 2 cores and 10 GB available for Hadoop. Would I benefit from a
> distributed run at all? It doesn't seem to me that the fetching/generating
> process is the bottleneck, but rather the (serial?) update of the database.
>

Generally speaking parallelizing the task will benefit you yes. Please
consider the above responses I've provided however before diving in with
this. Also note, it is possible to have overlapping crawls on the go even
on one machine.


> - Since crawling is not the issue, could I potentially benefit from
> switching to Nutch 2.x?
>

There is no reason why Nutch 1.X is not able to scale to this task. Your
dataset is not overly large by any means. I would stick with what you have
got and make an attempt to optimize configuration.


> - Is there any known reason that Solr might "reject" an indexing step, or
> was it just some temporary error? I have honestly not tried it again, since
> I have temporal limitations regarding the crawl, and do not want to have to
> start over again.
>

Understood. Please check it the 'clean' task killed it off for you. If so,
then please remove this from your crawl process.


> - Is there any way to efficiently "skip" the update steps for most of the
> time, and only perform them once a certain amount of pages have been
> acquired?


yes absolutely. It is not completely necessary to do this after every crawl
cycle.


> Is it even normal that it take this long, or may I have some
> configuraitonal errors?
>

I think spending some time on the issue above should resolve your issues.

Lewis


Re: Many indexers

2017-06-14 Thread lewis john mcgibbney
Hi Roannel,
Markus worked on this quite a bit a while back. Please see
https://issues.apache.org/jira/browse/NUTCH-1480.
If you were able to pick this back up and update the patch with a pull
request I would happily review it, test it and provide feedback.

On Wed, Jun 14, 2017 at 7:42 AM,  wrote:

>
> From: "Roannel Fernández Hernández" 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Mon, 12 Jun 2017 10:28:42 -0400 (CDT)
> Subject: Many indexers
> Hi folks
>
> I'm using Nutch 1.12 and I have to send the all documents to different
> Solr servers (3 servers). Each Solr server is for different purposes, so
> the schemas isn't the same in each server. So I need to remove some fields
> before send it to a particular Solr server. How can I do that?
>
>


[ANNOUNCEMENT] Welcome Blackice as new Nutch PMC and Committer

2017-06-14 Thread lewis john mcgibbney
Hi Folks,
The Nutch PMC recently VOTE'd in Blackice to formally join our Nutch
Project Management Committee and as a Project Committer.
Please join me in offering a friendly welcome... not that he needs it He's
been here for quite a while :)
@Blackice, feel free to say a bit about yourself if you want to.
Thanks,
Lewis
(On behalf of the Nutch PMC)

-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


RE: What up with 2.3.1 ?

2017-06-05 Thread lewis john mcgibbney
Forwarding with correct thread name.

-- Forwarded message --
From: lewis john mcgibbney <lewi...@apache.org>
Date: Mon, Jun 5, 2017 at 2:50 PM
Subject: Re: user Digest 3 Jun 2017 19:27:20 - Issue 2758
To: "user@nutch.apache.org" <user@nutch.apache.org>


Hi Ed,
Disappointing to hear that this really got under your skin... never nice to
hear that frustration becomes the outcome rather than successfully running
the software. I've provided comments below

On Sat, Jun 3, 2017 at 12:27 PM, <user-digest-h...@nutch.apache.org> wrote:

>
> From: Edward Capriolo <edlinuxg...@gmail.com>
> To: "user@nutch.apache.org" <user@nutch.apache.org>
> Cc:
> Bcc:
> Date: Sat, 3 Jun 2017 15:27:06 -0400
> Subject: What up with 2.3.1 ?
>
> Nutch 2.3.1, I have to say, I do not even understand it as a release.
>

This could be understood... as a previous (historical) user of the Nutch
1.X series... you seem to have prior expectations which are/were based on a
simplified technology stack. Nutch 2.X is aimed at using a different stack
and focuses on use of more modern storage solutions as you've found out. It
has never really been touted as the go-to Nutch branch... you will notice
that Nutch 1.X is the mainstream (master) branch. You'll also see, that
over a number of years, the message has been consistent... Nutch 1.X is the
go-to software both for users of source and release artifacts.


>
> First, I attempted to ...



If you want to use Nutch 2.3.1 with HBase, you should use the backend
datastore support which ships with the release announcement. That is as
follows

Apache Avro 1.7.6
Apache Hadoop 1.2.1 and 2.5.2
Apache HBase 0.98.8-hadoop2 (although also tested with 1.X)
Apache Cassandra 2.0.2
Apache Solr 4.10.3
MongoDB 2.6.X
Apache Accumlo 1.5.1
Apache Spark 1.4.1

I've tried my best, alongside several others over at the Gora community, to
ensure all of these datastores are documented over at
http://gora.apache.org/current/index.html#gora-modules.
It should be noted that since then, Gora master branch contains datastore
version upgrades for nearly every datastore.


>
>
>
> I just do not get the entire 2.3.1 release. It is very frustrating.


Yes, as I said this is disappointing to see that you struggled so much with
this. I've tried to make best efforts to ensure our Nutch2 tutorial is
up-to-date
https://wiki.apache.org/nutch/Nutch2Tutorial


> The
> webui's tend to fire blank pages with no stack traces.


Please feel free to log issues... if it is broken then we can try to fix
it. Without some Jira issue or debug information then we don't know it is
broken.


> Its unclear why
> backends that do not work are even documented.


HBase is most widely used, followed by MongoDB... on the other end of the
spectrum, Cassandra is least used and broken. It has not been maintained
for quite some time... and yes this is reflected by use of Super Columns.
We are currently re-writing the backend as part of a GSoC project.


> How can even the file/avro
> support not even work?
>

Please log your issue(s) in Jira and I can try to reproduce it using 2.x
branch. I do not use this backend now when I have deployed 2.X. I was not
aware that it was broken.
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: user Digest 3 Jun 2017 19:27:20 -0000 Issue 2758

2017-06-05 Thread lewis john mcgibbney
Hi Ed,
Disappointing to hear that this really got under your skin... never nice to
hear that frustration becomes the outcome rather than successfully running
the software. I've provided comments below

On Sat, Jun 3, 2017 at 12:27 PM,  wrote:

>
> From: Edward Capriolo 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Sat, 3 Jun 2017 15:27:06 -0400
> Subject: What up with 2.3.1 ?
>
> Nutch 2.3.1, I have to say, I do not even understand it as a release.
>

This could be understood... as a previous (historical) user of the Nutch
1.X series... you seem to have prior expectations which are/were based on a
simplified technology stack. Nutch 2.X is aimed at using a different stack
and focuses on use of more modern storage solutions as you've found out. It
has never really been touted as the go-to Nutch branch... you will notice
that Nutch 1.X is the mainstream (master) branch. You'll also see, that
over a number of years, the message has been consistent... Nutch 1.X is the
go-to software both for users of source and release artifacts.


>
> First, I attempted to ...



If you want to use Nutch 2.3.1 with HBase, you should use the backend
datastore support which ships with the release announcement. That is as
follows

Apache Avro 1.7.6
Apache Hadoop 1.2.1 and 2.5.2
Apache HBase 0.98.8-hadoop2 (although also tested with 1.X)
Apache Cassandra 2.0.2
Apache Solr 4.10.3
MongoDB 2.6.X
Apache Accumlo 1.5.1
Apache Spark 1.4.1

I've tried my best, alongside several others over at the Gora community, to
ensure all of these datastores are documented over at
http://gora.apache.org/current/index.html#gora-modules.
It should be noted that since then, Gora master branch contains datastore
version upgrades for nearly every datastore.


>
>
>
> I just do not get the entire 2.3.1 release. It is very frustrating.


Yes, as I said this is disappointing to see that you struggled so much with
this. I've tried to make best efforts to ensure our Nutch2 tutorial is
up-to-date
https://wiki.apache.org/nutch/Nutch2Tutorial


> The
> webui's tend to fire blank pages with no stack traces.


Please feel free to log issues... if it is broken then we can try to fix
it. Without some Jira issue or debug information then we don't know it is
broken.


> Its unclear why
> backends that do not work are even documented.


HBase is most widely used, followed by MongoDB... on the other end of the
spectrum, Cassandra is least used and broken. It has not been maintained
for quite some time... and yes this is reflected by use of Super Columns.
We are currently re-writing the backend as part of a GSoC project.


> How can even the file/avro
> support not even work?
>

Please log your issue(s) in Jira and I can try to reproduce it using 2.x
branch. I do not use this backend now when I have deployed 2.X. I was not
aware that it was broken.
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: user Digest 17 Apr 2017 22:31:08 -0000 Issue 2738

2017-04-17 Thread lewis john mcgibbney
Hi Yongyao,
The code in question is found below
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/Generator.java#L230-L232
A few things come to mind here...
 * are you sure the entries with a lower score than the minimum threshold
were not present before you established the threshold configuration?
 * have you rebuilt the Nutch source code after establishing the
configuration such that the desired configuration is available to the Nutch
deployment?
Lewis

On Mon, Apr 17, 2017 at 3:31 PM,  wrote:

>
> From: Yongyao Jiang 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Mon, 17 Apr 2017 18:31:05 -0400
> Subject: Why "generate.min.score" does not work?
> Hi,
>
> I am using scoring-similarity plugin. After setting the generate.min.score
> to 0.05, and indexing all the pages (with its score) into Elastic, I can
> still observe many web pages whose scores are below 0.05.
>
> 
>   generate.min.score
>   0.05
>   Select only entries with a score larger than
>   generate.min.score.
> 
>
> Below is the result of a simple aggregation of "score" in ES,
> {
>"key": "20170417215917",
>"doc_count": 200,
>"Stats": {
>   "count": 200,
>   "min": 0,
>   "max": 0.019184709,
>   "avg": 0.001282872445002,
>   "sum": 0.256574489
>}
> }
>
> Thanks,
> Yongyao
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Nutch Plugins Source Control

2017-04-07 Thread lewis john mcgibbney
Hi Folks,

Maven build is actually pretty close now. We need to bring the following
branch up-to-date with 1.14 then stabilize tests... then it is good to
propose as a PR for 1.14-SNAPSHOT.
Transferring this work over to 2.x will be much easier than the work done
for master branch.
I'm over on JIRA discussing this on the ticket.
Lewis

On Fri, Apr 7, 2017 at 2:24 PM,  wrote:

>
> From: Chris Mattmann 
> To: "user@nutch.apache.org" 
> Cc:
> Bcc:
> Date: Fri, 07 Apr 2017 10:03:46 -0700
> Subject: Re: Nutch Plugins Source Control
> Thanks Julien.
>
> We do intend to publish the artifacts to Central, so they should be
> available in the org.apache tree.
>
> Thamme, Lewis, any update on the Mavenization?
>
>


[ANNOUNCE] Apache Nutch 1.13 Release

2017-04-02 Thread lewis john mcgibbney
Hello Folks,

The Apache Nutch [0] Project Management Committee are pleased to announce
the immediate release of Apache Nutch v1.13, we advise all current users
and developers of the 1.X series to upgrade to this release.

Nutch is a well matured, production ready Web crawler. Nutch 1.x enables
fine grained configuration, relying on Apache Hadoop™
 data structures, which are great for batch
processing.

The Nutch DOAP can be found at [1]. An account of the CHANGES in this
release can be seen in the release report .

As usual in the 1.X series, release artifacts are made available as both
source and binary and also available within Maven Central

as a Maven dependency. The release is available from our DOWNLAODS PAGE
.
Thank you
Lewis
(On behalf of the Nutch PMC)

[0] http://nutch.apache.org
[1] https://svn.apache.org/repos/asf/nutch/cms_site/trunk/content/doap.rdf
-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


[RESULT] WAS Re: [VOTE] Release Apache Nutch 1.13 RC#1

2017-04-02 Thread lewis john mcgibbney
Hi Folks,
Thank you to everyone who was able to review the RC and VOTE, greatly
appreciated.
72 has come and gone, please see below for RESULT's.

[9] +1 Release this package as Apache Nutch 1.13.
Lewis John McGibbney *
Julien Nioche *
Kevin Ratnasekera
Chris A. Mattmann *
Furkan KAMACI *
Matei Miroslav
Markus Jelsma *
Jorge Luis Betancourt González *
Sebastian Nagel *

[0] -1 Do not release this package because…

* Nutch PMC

The VOTE passes. Thank you to everyone able to contribute towards the Nutch
1.13 release.
Lewis

On Tue, Mar 28, 2017 at 9:20 PM, lewis john mcgibbney <lewi...@apache.org>
wrote:

> Hi Folks,
>
> A first candidate for the Nutch 1.13 release is available at:
>
>   https://dist.apache.org/repos/dist/dev/nutch/1.13/
>
> The release candidate is a zip and tar.gz archive of the binary and
> sources in:
> https://github.com/apache/nutch/tree/release-1.13
>
> The SHA1 checksum of the archive is
> bd0da3569aa14105799ed39204d4f0a31c77b42c
>
> In addition, a staged maven repository is available here:
>
> https://repository.apache.org/content/repositories/orgapachenutch-1013
>
> We addressed 29 Issues - https://s.apache.org/wq3x
>
> Please vote on releasing this package as Apache Nutch 1.13.
> The vote is open for the next 72 hours and passes if a majority of at
> least three +1 Nutch PMC votes are cast.
>
> [ ] +1 Release this package as Apache Nutch 1.13.
> [ ] -1 Do not release this package because…
>
> Cheers,
> Lewis
> (On behalf of the Nutch PMC)
>
> P.S. Here is my +1.
>
> --
> http://home.apache.org/~lewismc/
> @hectorMcSpector
> http://www.linkedin.com/in/lmcgibbney
>



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: How does scoring chain work

2017-03-29 Thread lewis john mcgibbney
Hi Yongyao,

In addition to Seb's response, please also check out the
'scoring.filter.order' property in nutch-site.xml
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1429-L1437
This will determine the order and provide you with more control over
complex scoring logic.
Lewis

On Wed, Mar 29, 2017 at 6:16 AM,  wrote:

>
> From: Yongyao Jiang 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 28 Mar 2017 16:48:38 -0400
> Subject: How does scoring chain work
> Hi,
>
> I got a question about how the scoring works when I was trying to use
> multiple scoring plugins together.
>
> For example, if I use "scoring-(opic|similarity)", the opic score of a page
> is 0.2, and the similarity score is 0.5, what would be the final score? Is
> there anyway to configure this?
>
> Thanks,
> Yongyao
>
>


[VOTE] Release Apache Nutch 1.13 RC#1

2017-03-28 Thread lewis john mcgibbney
Hi Folks,

A first candidate for the Nutch 1.13 release is available at:

  https://dist.apache.org/repos/dist/dev/nutch/1.13/

The release candidate is a zip and tar.gz archive of the binary and sources
in:
https://github.com/apache/nutch/tree/release-1.13

The SHA1 checksum of the archive is
bd0da3569aa14105799ed39204d4f0a31c77b42c

In addition, a staged maven repository is available here:

https://repository.apache.org/content/repositories/orgapachenutch-1013

We addressed 29 Issues - https://s.apache.org/wq3x

Please vote on releasing this package as Apache Nutch 1.13.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch 1.13.
[ ] -1 Do not release this package because…

Cheers,
Lewis
(On behalf of the Nutch PMC)

P.S. Here is my +1.

-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: How to configure Apache gora to take only ol as column family ?

2017-03-16 Thread lewis john mcgibbney
Hi suyash,

This issue can be addressed by essentially, commenting OUT all of the
instances where the WebPage [0] object is augmented within each job (and
possibly plugin).
An example would be as follows
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/parse/ParseUtil.java#L358
You need to step through the entire codebase and essentially comment out
setting (and maybe getting) values from the WebPage object.
The alternative option, is to simply create a new WebPage schema with only
the outlinks data structure, then use the 'ant generate-gora-src' target to
recompile the Webpage Class.
https://github.com/apache/nutch/blob/2.x/build.xml#L612-L623
You can then attempt to recompile the project and address each compile
error sequentially until all you have remaining is code pertaining to
outlinks.
hth
Lewis

[0]
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/storage/WebPage.java

On Thu, Mar 16, 2017 at 2:45 AM,  wrote:

>
> From: suyash singh 
> To: user@nutch.apache.org
> Cc:
> Bcc:
> Date: Tue, 14 Mar 2017 01:30:49 +0530
> Subject: Re: extract elements from each url as json and write it to s3
> Hi,
> I think you have to take database like mongodb. Write your custom gora
> mongodb mapping.xml and pass your Jason object to this.
>
> Thanks,
> suyash
>
>


RE: Nutch2 - What are exactly the steps to execute?

2016-11-21 Thread lewis john mcgibbney
Hi Daniele,
In short, if I were you I would look into using the readdb resource
https://wiki.apache.org/nutch/bin/nutch%20readdb
This will enable you to take a peek into your MongoDB table and find out
which documents are present. By the looks of it from your Gist nothing is
being fetched and therefore no outlinks are being parsed out... however I
may be wrong. You can check using the readdb resource as above.
hth

On Sat, Nov 19, 2016 at 8:09 AM,  wrote:

> From: Daniele Cremonini 
> To: 
> Cc:
> Date: Fri, 18 Nov 2016 15:28:49 +0100 (CET)
> Subject: Nutch2 - What are exactly the steps to execute?
> Hello,
>
> I installed and configured Nutch2 with MongoDB and Elasticsearch.
>
> I’m pretty convinced that the configuration is correct but I don’t see how
> to invoke Nutch.
>
> In this page : https://wiki.apache.org/nutch/NutchTutorial there are I
> think enough details to call Nutch 1.x
> but in this page : https://wiki.apache.org/nutch/Nutch2Tutorial the Invoke
> chapter is pretty poor.
>
> What I did :
>
> bin/nutch inject /apps/nutch-urls/
> bin/nutch generate -topN 40
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb -all
> bin/nutch index –all
>
> but Nutch never tries to index data I know because I enriched the logging
> activity of ElasticIndexWriter a little bit.
>
> May anybody give me some ideas?
>
>


Re: indexing to Solr

2016-11-21 Thread lewis john mcgibbney
Hi Michael,

On Sat, Nov 19, 2016 at 8:09 AM,  wrote:

> From: Michael Coffey 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Fri, 18 Nov 2016 21:15:14 + (UTC)
> Subject: indexing to Solr
> Where can I find up-to-date information on indexing to Solr?


http://wiki.apache.org/nutch/NutchTutorial
in particular
https://wiki.apache.org/nutch/NutchTutorial#Step-by-Step:_Indexing_into_Apache_Solr
If you find any issues with this tutorial then please let us know. Thank
you.


> When I search the web, I find tutorials that use the deprecated solrindex
> command. I also find questions where people want to know why it doesn't
> work.
>

That is because the only official documentation resides at
http://wiki.apache.org/nutch/NutchTutorial


> I have a good nutch 1.12 installation on a working hadoop cluster and a
> Solr 6.3.0 installation which works for their gettingstarted example.
>

You should use the specified version of Solr for the Nutch release. This is
Solr 5.4.1 as defined in the indexer-solr plugin ivy.xml


> I have questions likeDo I need to create a core and a collection in solr?


Yes I would. This is explained at
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


> Do I need http or cloud type server?Do I need solr.zookeeper.url ?
>

This is not a Nutch question. This is your preferred Solr configuration. If
you are just starting out then I would say it is not a big deal...
experiment and go with what works best for your requirements and resources
capacity.


> What else needs to be set in nutch-site.xml?
>

Not much. For reference though, here are the Solr configuration options.
https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1750-L1826


> What about schema?
>

This is covered in
https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search


>
> Thanks for all the help so far!
>
>
No problems. Any more issues, ping us here and we will help.
Ta


Re: how to insert nutch into ambari ecosystem ?

2016-11-15 Thread lewis john mcgibbney
Hi Eyeris,
Replies inline

On Fri, Oct 28, 2016 at 8:51 PM,  wrote:

> From: Eyeris Rodriguez Rueda 
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 28 Oct 2016 09:43:59 -0400 (CDT)
> Subject: how to insert nutch into ambari ecosystem ?
> Hi all.
> I have installed ambari ecosystem and it services is running
> ok(accumulo,yarn,zookeeper and others).
>

Good.


> My environment is a short cluster with 8 servers using ubuntu Server 14.04
> because ambari is not yet compatible with ubuntu server 16.04.
>

OK


> But i don't know how to insert nutch into ambari ecosystem to make crawl
> and also index with solr.
> Please any help or advice will be appreciated.
>
>
Well there are two parts to this.

One is us working over on the Ambari/BigTop platforms to ensure that the
relevant compatible packaging is created such that the option to build
Nutch with the Hadoop stack is shipped and available within Ambari. This is
probably a fair amount of work... but something which would be useful there
is no doubt about that.

The other is that when launching Hadoop clusters with Ambari and wishing to
run Nutch on there, you can do so as you would do so normally. Just log
into the head node and launch your Nutch crawler in deploy mode... simple
as that.
Any issues, let us know.
lewis


Re: user Digest 7 Nov 2016 19:53:09 -0000 Issue 2672

2016-11-15 Thread lewis john mcgibbney
Hi Eyeris,
I've just tried Nutch master branch to parse outlinks from a number of RSS
Feeds, an example being 'http://www.jpl.nasa.gov/blog/feed/'. This works
perfectly with both the feed and parse-tika plugins. Outlinks are extracted
accordingly.
Can you provide an example of the RSS Feeds you are looking to parse
outlinks from? Are they valid?
An excellent resource to use for this kind of trouble shooting is the
ParseChecker tool
https://wiki.apache.org/nutch/bin/nutch%20parsechecker
hth


On Mon, Nov 7, 2016 at 11:53 AM,  wrote:

> From: Eyeris Rodriguez Rueda 
> To: user@nutch.apache.org
> Cc:
> Date: Sun, 6 Nov 2016 12:14:29 -0500 (CST)
> Subject: how to insert outlinks from rss in crawldb ?
>
> Hi.
> I am using nutch 1.12 and solr 4.10.3.
>
> Rss is a significant way to discover new url to fetch.
>
> All links detected in a rss are not inserted in crawldb as new urls.
> any body can tell me why?
> Please any body can help me or point me in the right direction for insert
> outlinks from feed to crawldb, and visit its in the next iteration.
>
> I have activated only tika parser because using both (tika and feed) the
> field content and outlinks are empty in solr.
>
>
>
>
>
>
>
>
>
>
>  E don´t are inserted in crawldb and also don´t visited in next iterations
> of c
>
>
>


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Nutch 2.3.1 REST calls to DB

2016-11-15 Thread lewis john mcgibbney
Hi Vladimir,
Responses inline

On Thu, Nov 10, 2016 at 1:05 AM,  wrote:

> From: Vladimir Loubenski 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Tue, 8 Nov 2016 17:53:59 +
> Subject: Nutch 2.3.1 REST calls to DB
> Hi,
> Nutch 2.x REST API documentation is mentioned  following syntax for DB
> calls  :https://wiki.apache.org/nutch/NutchRESTAPI
>
> 1. What does mean  "startKey", "endKey" and  "isKeysReversed" ?
> POST /db
>{
>   "startKey":"com.google",
>   "endKey":"com.yahoo",
>   "isKeysReversed":"true"
>}
>

Well essentially you are running a DB query here this is because we are
attempting to obtain data from one of the Gora supported databases. If you
wish to read the code then please see
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/api/resources/DbResource.java
In this case the startKey and endKey are what make up the 'DbFilter'
object. More on the particular object semantics can be seen at
https://github.com/apache/nutch/blob/2.x/src/java/org/apache/nutch/api/model/request/DbFilter.java
In this case you are setting a start key, and an end key from which to scan
and for which to return a results Iterator. Please note, that right now we
do not have consistency in the way that start keys or end keys are
inclusive or not within the Gora Query API.
Now on to the 'isKeysReversed' aspect of the JSON configuration. This one
relates to whether or not your key's represent a URL in it's reversed form.
In both Nutch 1.X and 2.X, keys within the WebGraph DB are reversed due to
the improvements this offers us in terms of query and scan performance.
Lets take an example

'org.apache.nutch...'

This means that we can scan initially for 'org' then 'apache' meaning that
we are scanning a significantly reduced subset of the data contained within
the WebGraph DB. On the other hand lets consider the following

'http://nutch.apache.org...'

This would mean that we query by 'http://', then 'nutch'

The issue with querying for 'http://' is that more or less EVERY key within
the DB would contain 'http://' meaning that our path to query is
significantly increased and our query is not going to be very efficient at
all.



>
> Call bellow doesn't work for me. It always return empty result
> POST /db
>{
>   "batchId": "batch-id"
>}
>
>
Please ensure that you have replaced the right hand side "batch-id" value
with the value of one of your BatchID identifiers. These are created at the
generate phase of a crawl cycle. In order to obtain a list of all BatchID's
you've created, you would need to query your Database separately outside of
Nutch and create a list of BatchID results.
hth
Lewis


Re: How can I Score?

2016-11-15 Thread lewis john mcgibbney
Hi Michael,
Replies inline

On Sat, Nov 12, 2016 at 7:10 PM,  wrote:

> From: Michael Coffey 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Sun, 13 Nov 2016 03:07:16 + (UTC)
> Subject: How can I Score?
> When the generator is used with -topN, it is supposed to choose the
> highest-scoring urls.


Yes this is the threshold of how many top scoring URLs you wish to generate
into a new Fetch list and subsequently fetch. When you use the crawl
script, the -topN is calculated as follows

$numSlaves * 5

By default, we assume that you are running on one machine (local mode)
therefore the numSlaves variable is set to 1.


> In my case, all the urls in my db have a score of zero, except the ones
> injected.
>

This is a bit strange. I would not expect them to have absolutely zero...
are you sure that it is not marginally above zero? Which scoring
plugin/mechanism are you currently using?


> How can I cause scores to be computed and stored?


Scores for each and every CrawlDatum are computed automatically
out-of-the-box.


> I am using the standard crawl script.


OK


> Do I need to enable the various webgraph lines in the script?
>
>
Not unless you wish to use the WebGraph scoring implementation...
Lewis


-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney


Re: Nutch 2.3.1 elasticsearch tstamp

2016-10-21 Thread lewis john mcgibbney
Hi Joe,

On Fri, Oct 21, 2016 at 7:34 AM,  wrote:

> From: Joe Adams 
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 21 Oct 2016 10:34:15 -0400
> Subject: Nutch 2.3.1 elasticsearch tstamp
> I'm working on setting up nutch with elasticsearch and hbase to crawl a
> site and provide a dashboard in kibana for reporting. I have the
> interactions working between the components. I can crawl the site, hbase
> shows all the data, and I can index into elasticsearch. The problem is that
> the tstamp field in elasticsearch shows 1970-01-01T00:00:00.000Z and not
> data related the fetched time of the page. I also tried adding the
> index-more plugin and that seems to add a 'date' field but this also shows
> up as epoch.
>
> I can't find much searching around the internet. The only thing I can find
> closely related is https://issues.apache.org/jira/browse/NUTCH-2045, but
> that was fixed in 2.3.1 which is the version I'm running.
>

My suggestion would be, that if you are running Nutch2, then use the
current development branch which is available at
https://github.com/apache/nutch/tree/2.x. I say this as we are always
fixing bugs and it will enable other using this branch a better chance of
reproducing your issue. Additionally, this will enable you to upgrade to ES
2.X as per the indexer-elastic2 plugin
https://github.com/apache/nutch/tree/2.x/src/plugin/indexer-elastic2


>
> Does anyone have any idea why my dates aren't being set properly in my
> elasticsearch index?


Not yet but I will scope it out.


> The data looks good if I run readdb -url $url.


Thanks for this info.


> Can
> anyone provide some good advice to troubleshoot this further?
>

Not right now, but can you please log an issue over at Jira and also link
it to NUTCH-2045? This would help us to track it and fix it with a test if
there is definitely a bug.


>
> Any help would be appreciated.
>
>
> Versions:
> Nutch 2.3.1
> Elasticsearch 1.7.5
> Gora: 0.6.1
> Hbase: 1.2.3
>

Please note that the supported version of HBase in Nutch2.3.1 is
0.98.8-hadoop2. I can most certainly say that HBase support will not be
compatible with HBase 1.2.3.


>
> 
> fetcher.server.delay
> .1
> Delay between page fetches.
> 
>
> 
> fetcher.server.min.delay
> .1
> 
>

You may find that you experienced access denied e.g. your IP is being
blocked from accessing servers at such small delay amounts. This is just a
friendly warning!

Please log the issue in Jira and I will try to reproduce.
Thanks
Lewis


Re: I think my hbase is broken

2016-10-21 Thread lewis john mcgibbney
Hi Tom,
Please post your entire Nutch log for the inject and generate phase if
possible. It is near impossible to debug given the information you've
provided.
Thanks

On Fri, Oct 21, 2016 at 7:34 AM,  wrote:

> From: Tom Chiverton 
> To: "user@nutch.apache.org" 
> Cc:
> Date: Thu, 20 Oct 2016 12:59:20 +0100
> Subject: I think my hbase is broken
>
> I'm using hbase with Nutch 2.3.1 and getting errors from the GeneratorJob
> step :
>
>
> GeneratorJob: java.io.IOException: Expecting at least one region.
> at org.apache.gora.hbase.store.HBaseStore.getPartitions(
> HBaseStore.java:398)
> at org.apache.gora.mapreduce.GoraInputFormat.getSplits(
> GoraInputFormat.java:94)
>
>
> I think this means hbase needs it's gora-based schema reapplying or
> something ? How does one do that ? Using a fresh hbase install doesn't seem
> to have helped.
>
>


Re: Nutch 2, Solr 5 - solrdedup causes ClassCastException:

2016-10-20 Thread lewis john mcgibbney
Hi Tom,
This looks like it has been frustrating for you so I've provided a walk
through of how I can set up a core using current Nutch 2.X schema.xml

On Mon, Oct 17, 2016 at 9:27 AM,  wrote:

>
> From: Tom Chiverton 
> To: user@nutch.apache.org
> Cc:
> Date: Mon, 17 Oct 2016 09:55:53 +0100
> Subject: Re: Nutch 2, Solr 5 - solrdedup causes ClassCastException:
> I tried that, and it still gives
>
> ERROR: Error CREATEing SolrCore 'nutch': Unable to create core [nutch]
> Caused by: enablePositionIncrements is not a valid option as of Lucene 5.0
>
> Tom
>
>
lmcgibbn@LMC-056430 /usr/local/solr-6.2.1 $ cp
/usr/local/nutch2/conf/schema.xml example/files/conf/
lmcgibbn@LMC-056430 /usr/local/solr-6.2.1 $ ./bin/solr start
Waiting up to 30 seconds to see Solr running on port 8983 [/]
Started Solr server on port 8983 (pid=49222). Happy searching!

lmcgibbn@LMC-056430 /usr/local/solr-6.2.1 $ ./bin/solr create -c nutch -d
/usr/local/solr-6.2.1/example/files/conf -p 8983

Copying configuration to new core instance directory:
/usr/local/solr-6.2.1/server/solr/nutch

Creating new core 'nutch' using command:
http://localhost:8983/solr/admin/cores?action=CREATE=nutch=nutch

{
  "responseHeader":{
"status":0,
"QTime":1657},
  "core":"nutch"}

I can now run my crawls on Nutch 2.X. Can you please replicate the above
then tell me where and if anything goes wrong?
Thanks
Lewis


Re: Nutch in production

2016-10-18 Thread lewis john mcgibbney
Hi Sachin,
Answering both of your questions here as I am catching up with some mail.

On Fri, Sep 30, 2016 at 5:04 AM,  wrote:

>
> From: Sachin Shaju 
> To: user@nutch.apache.org
> Cc:
> Date: Fri, 30 Sep 2016 10:00:04 +0530
> Subject: Re: Nutch in production
> Thank you guys for your replies. I will look into the suggestions you gave.
> But I have one more query. How can I trigger nutch from a queue system in a
> distributed environment ?


Well this is a bit more tricky of course, as per my other mailing list
thread, you can easily use the REST API and the Nutchserver for publishing
Nutch workflows so I would advise you to look into that.


> Can REST api be a real option in distributed mode
> ?


As per my other thread... yes :) The one limitation is getting the injected
URLs into HDFS for use within the rest of the workflow.


> Or whether I will have to go for a command line invocation for nutch ?
>
>
I think that we need to provide a patch for Nutch trunk to enable ingestion
of the injected seeds into HDFS via the REST API. Right now this
functionality is lacking. I've created a ticket for it at
https://issues.apache.org/jira/browse/NUTCH-2327

We will try to address this before the pending Nutch 1.13 release however I
cannot promise anything.
Thanjs
Lewis


  1   2   3   4   5   6   7   8   9   10   >