Re: How to run nutch server on distributed environment

2016-10-18 Thread lewis john mcgibbney
Hi Sachin, Very late response I know but hopefully better later than never. Response below On Fri, Sep 30, 2016 at 5:04 AM, wrote: > > From: Sachin Shaju > To: user@nutch.apache.org > Cc: > Date: Thu, 29 Sep 2016 14:01:13 +0530 > Subject:

Re: Open Graph metadata?

2016-09-29 Thread lewis john mcgibbney
Hi Ralf, Do mean here the Open Graph Protocol [0] markup? If so, then if it is resent within then it is already parsed out and stored within Parse [1] and can be accessed Parse.getData(). Please use the ParserChecker to double check this and if necessary post an example here so that I can be

Re: Arch 1.9.2 is available

2016-09-29 Thread lewis john mcgibbney
Cool... thanks for posting. On Wed, Sep 28, 2016 at 1:36 AM, wrote: > > user Digest 28 Sep 2016 08:36:56 - Issue 2648 > > Topics (messages 32792 through 32792) > > Arch 1.9.2 is available > 32792 by: Arkadi.Kosmynin.csiro.au > > Administrivia: > >

Re: Pull All URL List

2016-08-26 Thread lewis john mcgibbney
Hi Manish, On Fri, Aug 26, 2016 at 2:16 PM, wrote: > > From: Manish Verma > To: user@nutch.apache.org > Cc: > Date: Fri, 26 Aug 2016 14:16:49 -0700 > Subject: Pull All URL List > Hi, > > Using nutch 1.12 is there any way to get urls

Re: Application creating huge amount of logs : Nutch 2.3.1 + Hadoop 2.7.1

2016-08-26 Thread lewis john mcgibbney
Hi shubham.gupta, On Fri, Aug 26, 2016 at 2:16 PM, wrote: > > From: Markus Jelsma > To: "user@nutch.apache.org" > Cc: > Date: Thu, 25 Aug 2016 08:30:23 + > Subject: RE: Application creating huge amount

Re:HBaseStore WARN

2016-08-19 Thread lewis john mcgibbney
Hi Olle, The logging you see originates from the gora-hbase module and exact lines can be found at [0]. Essentially what happens here is that we make a WARN'ing between what is passed as an mapping and an Avro Schema name which can also be located at [1]. You should most likely modify both and

Re: Upgrade to Nutch 1.12

2016-08-19 Thread lewis john mcgibbney
Evening Madhvi, I will set this up and debug a clean. I'll report over on https://issues.apache.org/jira/browse/NUTCH-2269 Thank you for reporting. Lewis On Thu, Aug 18, 2016 at 7:08 AM, wrote: > > From: "Arora, Madhvi" > To:

Re: Indexing to remote Solr server

2016-07-20 Thread Lewis John Mcgibbney
Hi RRK, Check out the current configuration options for securing indexing in various plugins. This is currently plugin specific. The Solr configuration can be seen at https://github.com/apache/nutch/blob/master/conf/nutch-default.xml#L1800-L1808 Basically change this property to true, then pass -D

[ANNOUNCE] Apache Nutch 1.12 Release

2016-06-19 Thread lewis john mcgibbney
The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.12, we advise all current users and developers of the 1.X series to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 1.x enables fine grained configuration, relying on Apache

Re: Newbie Question, hadoop error?

2016-06-15 Thread Lewis John Mcgibbney
Hi Sas, See response inline :) On Wed, Jun 15, 2016 at 5:36 AM, wrote: > From: "Jamal, Sarfaraz" > To: "'user@nutch.apache.org'" > Cc: > Date: Mon, 13 Jun 2016 17:36:44 -0400 > Subject:

Re: Indexing nutch crawled data in “Bluemix” solr

2016-06-14 Thread Lewis John Mcgibbney
Hi shakiba, On Sat, Jun 11, 2016 at 1:48 PM, wrote: > From: shakiba davari > To: user@nutch.apache.org > Cc: > Date: Thu, 9 Jun 2016 13:11:43 -0400 > Subject: Indexing nutch crawled data in “Bluemix” solr > 1down votefavorite > < >

Re: Webpage in HBase alternative name

2016-06-14 Thread Lewis John Mcgibbney
Hi Joe, On Mon, Jun 13, 2016 at 1:57 PM, wrote: > From: Joseph Obernberger > To: user@nutch.apache.org > Cc: > Date: Mon, 13 Jun 2016 10:39:05 -0400 > Subject: Re: Webpage in HBase alternative name > I see that the

Re: Crawldb

2016-06-14 Thread Lewis John Mcgibbney
Hi BlackIce, On Mon, Jun 13, 2016 at 1:57 PM, wrote: > From: BlackIce > To: user@nutch.apache.org > Cc: > Date: Mon, 13 Jun 2016 14:19:42 +0200 > Subject: Crawldb > I would like to "groom" the crawldb My guess is that it should be

Re: Robots.txt

2016-05-27 Thread Lewis John Mcgibbney
Hi Blackice, On top of what other folks hae input. Over the years, we have enforced that Nutch be a good bot by default. For example the following default configuration parameter fetcher.server.delay 5.0 The number of seconds the fetcher will delay between successive requests to the same

Re: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers

2016-05-20 Thread Lewis John Mcgibbney
Hi Joe, Please see the following line https://github.com/apache/nutch/blob/2.x/src/bin/crawl#L65\ and also https://github.com/apache/nutch/blob/2.x/src/bin/crawl#L91 then https://github.com/apache/nutch/blob/2.x/src/bin/crawl#L158 reduce tasks are directly aligned with nodes/2 What kind of set up

Re: Newbie trouble - Hbase class not found

2016-05-16 Thread Lewis John Mcgibbney
Hi Diego, The PR at https://github.com/apache/nutch/pull/111 will solve your issue. Thanks On Mon, May 16, 2016 at 11:40 AM, wrote: > > From: diego gullo > To: user@nutch.apache.org > Cc: > Date: Sun, 15 May 2016 20:04:05 +0100 >

Nutch Docker Images Available on Dockerhub

2016-05-15 Thread Lewis John Mcgibbney
Hi Folks, I recently worked with Infra to make Docker images available for our community. https://hub.docker.com/r/apache/nutch/ Master branch will always be latest, with the Nutch 2.X Cassandra and HBase images also available. The initial builds are still underway so the images may not be

Re: Newbie trouble - Hbase class not found

2016-05-09 Thread Lewis John Mcgibbney
Hi Diego, On Mon, May 9, 2016 at 2:32 AM, wrote: > > From: diego gullo > To: user@nutch.apache.org > Cc: > Date: Sat, 7 May 2016 09:41:00 +0100 > Subject: Newbie trouble - Hbase class not found > I am trying Nutch for the first time. I

Re: Nutch 1.x crawl Zip file URLs

2016-05-09 Thread Lewis John Mcgibbney
Hi AL, The content is being truncated at some 524276020 Bytes. Increase or disable the http.content.limit https://github.com/apache/nutch/blob/trunk/conf/nutch-default.xml#L216-L224 On Mon, May 9, 2016 at 2:32 AM, wrote: > > From: A Laxmi

Re: Nutch 1.x crawl Zip file URLs

2016-05-05 Thread Lewis John Mcgibbney
Hi AL, Yes please see parse-zip plugin https://github.com/apache/nutch/tree/master/src/plugin/parse-zip You can register this within the plugin.includes property in nutch-site.xml Thanks On Thu, May 5, 2016 at 7:00 PM, wrote: > From: A Laxmi

Nutch Presentation @ApacheCon Big Data

2016-05-04 Thread Lewis John Mcgibbney
Hi Folks, A heads up about my presentation @ApacheCon Big Data Next week in Vancouver, BC. I will be giving a presentation titled "Experiences Using Apache HTRace (Incubating) in Distributed Web Search", the sched can be found at http://sched.co/6M1a Thanks, look forward to seeing lots of people

Re: Visualization Tool for Nutch

2016-05-03 Thread Lewis John Mcgibbney
Assimilating with correct thread. -- Forwarded message -- From: Lewis John Mcgibbney <lewis.mcgibb...@gmail.com> Date: Tue, May 3, 2016 at 1:42 PM Subject: Re: user Digest 3 May 2016 14:53:20 - Issue 2582 To: "user@nutch.apache.org" <user@nutch.apache.org&

Re: user Digest 3 May 2016 14:53:20 -0000 Issue 2582

2016-05-03 Thread Lewis John Mcgibbney
Hi Bin, Hope you are doing well! Please see response below On Tue, May 3, 2016 at 7:53 AM, wrote: > > From: Bin Wang > To: "Apache.Nutch.User" > Cc: > Date: Mon, 2 May 2016 13:26:27 -0600 > Subject: Visualization

Re: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers

2016-05-03 Thread Lewis John Mcgibbney
Hi Joseph, On Tue, May 3, 2016 at 7:53 AM, wrote: > > From: Joseph Obernberger > To: user@nutch.apache.org > Cc: > Date: Tue, 3 May 2016 09:04:09 -0400 > Subject: Nutch 2.3.1 - Fetch Phase - Only 2 Reducers > Hello - I'm working

Re: [MASSMAIL]Re: Priorize links in Fetching Step

2016-05-03 Thread Lewis John Mcgibbney
Hi Yulio, On Tue, May 3, 2016 at 7:53 AM, wrote: > > From: Yulio Aleman Jimenez > To: user@nutch.apache.org > Cc: > Date: Sun, 1 May 2016 17:57:30 -0400 (CDT) > Subject: Re: [MASSMAIL]Re: Priorize links in Fetching Step > Hi Lewis. > > Thanks

Re: Priorize links in Fetching Step

2016-05-01 Thread Lewis John Mcgibbney
Hi Yulio, Marcus wrote the MimeAdaptiveFetchSchedule [0] implementation for exactly this purpose. You can utilize it as per [1] [0] https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/crawl/MimeAdaptiveFetchSchedule.java [1]

Re: Solr as backend in nutch 2.3.1

2016-04-28 Thread Lewis John Mcgibbney
Hi tkg_cangkul, On Sat, Apr 23, 2016 at 9:26 AM, wrote: > > From: tkg_cangkul > To: user@nutch.apache.org > Cc: > Date: Fri, 22 Apr 2016 10:57:31 +0700 > Subject: Re: Solr as backend in nutch 2.3.1 > Hi Lewis, thx for your reply, > >

Re: Solr as backend in nutch 2.3.1

2016-04-21 Thread Lewis John Mcgibbney
Hi tkg_cangkul, Replies inline On Thu, Apr 21, 2016 at 2:29 AM, wrote: > From: tkg_cangkul > To: user@nutch.apache.org > Cc: > Date: Thu, 21 Apr 2016 16:29:03 +0700 > Subject: Solr as backend in nutch 2.3.1 > hi i try to use solr as

Re: Dump Command in Apache Nutch 2.x

2016-04-21 Thread Lewis John Mcgibbney
Hi Nana, Replies below On Thu, Apr 21, 2016 at 2:29 AM, wrote: > From: Nana Pandiawan > To: User Apache Nutch > Cc: > Date: Thu, 21 Apr 2016 10:45:03 +0700 > Subject: Dump Command in Apache Nutch 2.x >

Re: Plugin order not working

2016-04-21 Thread Lewis John Mcgibbney
Hi Harsh, On Thu, Apr 21, 2016 at 2:29 AM, wrote: > > From: harsh > To: user@nutch.apache.org > Cc: > Date: Wed, 20 Apr 2016 17:26:34 +0530 > Subject: Plugin order not working > Hi All > I am using nutch 2.3.1+gora+mongoDB .I have

Re: Adding a new field to Nutch + MongoDB datastore using plugin

2016-04-13 Thread Lewis John Mcgibbney
Hi jvence, Please see my reply below On Wed, Apr 13, 2016 at 8:26 AM, wrote: > > From: jvence > To: user@nutch.apache.org > Cc: > Date: Tue, 12 Apr 2016 10:17:20 -0700 (MST) > Subject: Adding a new field to Nutch + MongoDB datastore using

Re: [CIS-CMMI-3] Re: [CIS-CMMI-3] Enabling/configuring Nutch logging?

2016-04-13 Thread Lewis John Mcgibbney
Hi Kshitij, On Wed, Apr 13, 2016 at 5:36 AM, Kshitij Shukla wrote: > Thanks for your reply Lewis, > > Regarding your points: > 1) I am already using parameterized messaging convention. > >From your line of Java code... you were not. You posted the following

Re: [CIS-CMMI-3] Enabling/configuring Nutch logging?

2016-04-11 Thread Lewis John Mcgibbney
Hi Kshitij, On Mon, Apr 11, 2016 at 8:12 AM, wrote: > > I am working on developing a plugin for nutch. I have added some code to > see the output either in console or in logs like this: > > *LOG.debug("Found keys :" + lcMetatag + "\t" + value);* > I would

Re: Index in storage-backend

2016-04-07 Thread Lewis John Mcgibbney
Hi Harsh, On Wed, Apr 6, 2016 at 7:14 AM, wrote: > > From: harsh > To: user@nutch.apache.org > Cc: > Date: Wed, 06 Apr 2016 15:43:05 +0530 > Subject: Index in storage-backend > Hi All > > While exploring nutch 2.3.1-gora-mongoDB I

Re: Apache Nutch : query

2016-04-07 Thread Lewis John Mcgibbney
Hi pesmadhu, Replies inline On Wed, Apr 6, 2016 at 7:14 AM, wrote: > > From: "pesmadhu ." > To: user@nutch.apache.org > Cc: > Date: Wed, 6 Apr 2016 15:12:28 +0530 > Subject: Apache Nutch : query > Hi, > >We have a requirement to scrape

Re: Get all the feed metadata

2016-03-30 Thread Lewis John Mcgibbney
Hi harsh, On Wed, Mar 30, 2016 at 5:18 AM, wrote: > From: harsh > To: user@nutch.apache.org > Cc: > Date: Tue, 29 Mar 2016 09:30:07 +0530 > Subject: Re: Get all the feed metadata > Hi Lewis > > seedurl.txt file is as follows > >

Re: [selenium] running selenium headless

2016-03-30 Thread Lewis John Mcgibbney
Hi Sabah, On Wed, Mar 30, 2016 at 5:18 AM, wrote: > Sent: Monday, March 28, 2016 8:07 PM > To: d...@nutch.apache.org > Subject: [selenium] running selenium headless > > > Hello, > > > I am new to nutch. I am trying to use the selenium plugin with nutch on a >

Re: Get all the feed metadata

2016-03-28 Thread Lewis John Mcgibbney
Hi harsh, Do you have an example URL? On Mon, Mar 28, 2016 at 2:59 AM, wrote: > > From: harsh > To: user@nutch.apache.org > Cc: > Date: Mon, 28 Mar 2016 15:29:26 +0530 > Subject: Get all the feed metadata > Hi All > I want to get all

Re: I have one small question that always intrigue me

2016-02-28 Thread Lewis John Mcgibbney
Hi Zara, I'm sorry but this question would be better asked over on the Solr lists http://lucene.apache.org/solr/resources.html#solr-user-list-solr-userlucene Lewis On Wed, Feb 24, 2016 at 12:39 PM, wrote: > From: Zara Parst > To:

Fwd: Query on fetcher.queue.mode property

2016-02-26 Thread Lewis John Mcgibbney
Forwarding on behalf of Manish @Manish, the agent@nutch list is used to report things like violations of the software use, etc. We will answer your question here on user@ Lewis -- Forwarded message -- From: Date: Fri, Feb 26, 2016 at 11:56 AM

Re: Inject command re-inject seed URLS.

2016-02-23 Thread Lewis John Mcgibbney
Hi Harsh On Tue, Feb 23, 2016 at 12:38 AM, wrote: > > From: harsh > To: user@nutch.apache.org > Cc: > Date: Tue, 23 Feb 2016 11:22:06 +0530 > Subject: Inject command re-inject seed URLS. > Hi All > When give INJECT command to inject

Re: fetch deletes all metadata except _csh_ and _rs_

2016-02-23 Thread Lewis John Mcgibbney
Hi Adnane, Yes, we were getting your mail. Just too busy too respond so thank you for your patience. OK so this sounds like a bug IMHO. No Metadata should be deleted, at the most updates should occur, that is all. Can you please log an issue at the Nutch Jira instance describing your Nutch 2.X

Re: Nutch 2.x integration with SOLR

2016-02-23 Thread Lewis John Mcgibbney
Hi Tom, On Wed, Feb 17, 2016 at 10:34 AM, wrote: > > From: Tom Running > To: user@nutch.apache.org > Cc: > Date: Tue, 16 Feb 2016 16:34:59 -0500 > Subject: Nutch 2.x integration with SOLR > Any one able to get Nutch 2.X working with

Re: Error fetching with nutch2.3.1 & cassandra: supercolumn parameter is not optional for super CF sc

2016-02-23 Thread Lewis John Mcgibbney
Hi Michael, On Wed, Feb 17, 2016 at 10:34 AM, wrote: > Subject: Error fetching with nutch2.3.1 & cassandra: supercolumn parameter > is not optional for super CF sc > Hello List, > > i'm new to nutch and i'm trying to run nutch with cassandra. > > [snip] >

Re: [CIS-CMMI-3] ScannerTimeoutException: 157036ms passed since the last invocation, timeout is currently set to 60000

2016-02-23 Thread Lewis John Mcgibbney
Hi Kshitij, On Mon, Feb 15, 2016 at 1:02 AM, wrote: > From: Kshitij Shukla > To: Nutch User , Hbase User > Cc: > Date: Mon, 15 Feb 2016 14:31:39 +0530 > Subject: [CIS-CMMI-3]

Re: Nutch/Tika failed to parse text/html content

2016-02-23 Thread Lewis John Mcgibbney
Hi Arthur, On Mon, Feb 15, 2016 at 1:02 AM, wrote: > From: Arthur Yarwood > To: "user@nutch.apache.org" > Cc: > Date: Sun, 14 Feb 2016 22:08:13 + > Subject: Nutch/Tika failed to parse text/html content > I'm

Re: Extracting title description and keywords from a fetched URL

2016-02-23 Thread Lewis John Mcgibbney
Hi Gideon, On Sun, Feb 14, 2016 at 1:50 AM, wrote: > Subject: Extracting title description and keywords from a fetched URL > Hi everyone, > > I'm trying to crawl several websites and extract only their title, keyword > and description (and nothing else) > I

Re: runtime exception during nutch generate

2016-02-23 Thread Lewis John Mcgibbney
Hi Binoy, Apologies for rather late reply. Please see my responses inline On Sun, Feb 14, 2016 at 1:50 AM, wrote: > From: Binoy Dalal > To: user@nutch.apache.org > Cc: > Date: Sat, 13 Feb 2016 16:01:37 + > Subject: runtime

Re: no respond after inject

2016-02-10 Thread Lewis John Mcgibbney
Hi Dan, On Wed, Feb 10, 2016 at 5:06 AM, wrote: > > Sorry for another beginner question. After I installed nutch-2.3.1. > hbase-0.98.9 and elasticsearch-2.1.0 > I start to test to crawl one website with 'nutch inject urls.txt' > > The terminal window shows

Re: Solr 4.7 Index Replication not working

2016-02-10 Thread Lewis John Mcgibbney
Hi Jackie, Please take your query over to the solr-u...@lucene.apache.org list. You will get better help there. Thanks On Wed, Feb 10, 2016 at 5:06 AM, wrote: > > From: "Richardson, Jacquelyn F." > To: "user@nutch.apache.org"

Fwd: private Digest 5 Feb 2016 18:05:43 -0000 Issue 354

2016-02-05 Thread Lewis John Mcgibbney
h 1271) [REMINDER] ApacheCon NA 2016 Travel Assistance Applications now open! 1271 by: lewis john mcgibbney Administrivia: - To post to the list, e-mail: priv...@nutch.apache.org To unsubscribe, e-mail: private-digest-un

Re: [CIS-CMMI-3] HBASE_CLIENT_PREFETCH_LIMIT

2016-02-03 Thread Lewis John Mcgibbney
Hi Kshitij, On Wed, Feb 3, 2016 at 7:48 AM, wrote: > I am having this error when I tried to run ./crawl command. > I am using following software stack : > > *apache-nutch-2.3.1** > **hadoop-2.6.2** > **hbase-1.1.3* > Please note the suggested software stack

Re: configuration nutch with hbase and elasticserach

2016-02-03 Thread Lewis John Mcgibbney
Hi Dan, On Wed, Feb 3, 2016 at 7:48 AM, wrote: > > Thanks, it helped. > Hadoop 2.5.2 and Hbase-0.98.8-hadoop2 are both installed in standalone > mode. And I can run them separately. > However, when I run nutch inject, I ran into the problem like > Caused by:

Re: configuration nutch with hbase and elasticserach

2016-02-02 Thread Lewis John Mcgibbney
Hi Dan, I would advise you to use a JDK 1.7. Also, if you are just starting out then please follow this tutorial http://wiki.apache.org/nutch/Nutch2Tutorial If you have any issues then please let us know. Thanks On Fri, Jan 29, 2016 at 10:20 PM, wrote: > Hi

Re: Error running nutch on Hortonworks HDP

2016-02-02 Thread Lewis John Mcgibbney
Hi Xtroce, On Fri, Jan 29, 2016 at 10:20 PM, wrote: > I just started using nutch, and after spending yesterday to figure out > how to run nutch on the newest HDP (2.3.2) vm i ran into some > problems. > > Building the source directly, went fine, > but after

Re: Filter Urls Only At Generation Time Or Fetch Time

2016-02-02 Thread Lewis John Mcgibbney
Hi Manish, On Fri, Jan 29, 2016 at 10:20 PM, wrote: > I am using Nutch 1.10 and we are planing to crawl just some url which > match some pattern. > The problem is we can not do it using regex-urlfilter.txt as this way the > seeds itself would be rejected. > >

Re: configuration nutch with hbase and elasticserach

2016-01-27 Thread Lewis John Mcgibbney
Hi Dan, Which version of Nutch 2.X are you using? The document you've highlighted below stated Nutch 2.3 with gora-hbase 0.5. Both of these are old and I would strongly advise you to use Nutch 2.3.1 (just released last week) along with one of the following backends The recommended Gora backends

[ANNOUNCE] Apache Nutch 2.3.1 Release

2016-01-21 Thread lewis john mcgibbney
Hi Folks, !!Apologies for cross posting!! The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v2.3.1, we advise all current users and developers of the 2.X series to upgrade to this release. Nutch is a well matured, production ready Web crawler. Nutch 2.X branch

[RESULT] WAS Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-21 Thread Lewis John Mcgibbney
Hi Folks, I am bringing this VOTE to a close with the following results [3] +1 Release this package as Apache Nutch 2.3.1. Lewis John McGibbney* Sebastian Nagel* Chris Mattmann* [0] -1 Do not release this package because… *Nutch PMC Member I am really happy to therefore announce that the VOTE

Re: user Digest 16 Jan 2016 13:19:55 -0000 Issue 2520

2016-01-16 Thread Lewis John Mcgibbney
Hi Manish, On Sat, Jan 16, 2016 at 5:19 AM, wrote: > I was Checking the Nutch logs,I observed there are more fetching logs then > parsed logs. > I understand parsing does not happen for urls with fetch fail but the > difference is so high, any Idea ? How did

Re: [VOTE] Release Apache Nutch 2.3.1rc2

2016-01-13 Thread Lewis John Mcgibbney
Any others above to review please? On Sun, Jan 10, 2016 at 7:01 AM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi Folks, > > A second candidate for the Nutch 2.3.1 release is available at: > > https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ >

Re: Custom Generator or ScoringFilter (or Fetch)

2016-01-13 Thread Lewis John Mcgibbney
Hi Lex, On Wed, Jan 13, 2016 at 2:49 PM, wrote: > Thanks for the response Lewis. > np > > I'll give nucth 2.3.1 a spin later tonight. > Nice > > I didn't have success with batchId. I thought I could overwrite this in the > DB with 123 and then ./fetch

Re: Custom Generator or ScoringFilter (or Fetch)

2016-01-12 Thread Lewis John Mcgibbney
Hi Lex, On Mon, Jan 11, 2016 at 2:16 PM, wrote: > > I'm using Nutch 2.3. > Please note we are on the very cusp of releasing Apache Nutch 2.3.1 which has a number of bug fixes and improvements. There is a VOTE out right now for it. If you have time please

Re: How To Debug Fetch Phase IN Nutch 1.10

2016-01-10 Thread Lewis John Mcgibbney
Hi Manish, On Sat, Jan 9, 2016 at 1:05 AM, wrote: > Hi, I saw below article for debugging nutch in eclipse but looks like it > just debug parse phase and skip Fetching phase > $NUTCH_HOME/bin/nutch parsechecker http://myurl.com/ > This is not strictly true.

[VOTE] Release Apache Nutch 2.3.1rc2

2016-01-10 Thread Lewis John Mcgibbney
Hi Folks, A second candidate for the Nutch 2.3.1 release is available at: https://dist.apache.org/repos/dist/dev/nutch/2.3.1rc2/ The release candidate is a zip and tar.gz sources archive of the sources in: http://svn.apache.org/repos/asf/nutch/tags/release-2.3.1rc2/ In addition, a staged

Re: Custom Generator or ScoringFilter (or Fetch)

2016-01-10 Thread Lewis John Mcgibbney
Hi Lex, Which version of Nutch are you using? On Sat, Jan 9, 2016 at 1:05 AM, wrote: > > I've been curious this year to delve further into Nutch. I have been using > generate/fetch/parse/update but noticed some pages get re-crawled before > fetching new

Re: nutch 2.x nutchserver problem

2016-01-04 Thread Lewis John Mcgibbney
Hi Paul, I cannot reproduce this initialization error... I show this as follows using 2.X from SVN lmcgibbn@LMC-032857 /usr/local/2hbase(joshua) $ svn status lmcgibbn@LMC-032857 /usr/local/2hbase(joshua) $ svn update Updating '.': At revision 1722842. lmcgibbn@LMC-032857

Re: Choosing Amazon Instance type large vs small for large scale crawling

2015-12-29 Thread Lewis John Mcgibbney
Hi Ameer, On Sun, Dec 20, 2015 at 6:09 PM, wrote: > > With this configuration, i am able to crawl 500k url every 4 hours or so. > Sounds like reasonable throughput. You should be able to improve this however. > When i monitor the time for each phase, the

Re: URLS Which Has Redirection Also Getting Indexed

2015-12-29 Thread Lewis John Mcgibbney
Hi Manish, On Fri, Dec 25, 2015 at 1:43 PM, wrote: > > > Let me explain with example. > > Let’s say we have URL A and it is getting redirected to URL B , I see both > A and B getting indexed, I don’t want to index A when it’s redirecting to > another URL. > >

Re: Nutch Crawls More From Seed Then The Discovered Links

2015-12-29 Thread Lewis John Mcgibbney
Hi Manish, Evidence of this from the crawldb statistics would be helpful. I have not noticed behavior of this nature however it may also have to do with a robots.txt issue or other restriction. Can you provide some crawldb statistics please? Thanks Lewis On Wed, Dec 23, 2015 at 7:11 AM,

Re: Deploy a Nutch crawler or use Webhose.io?

2015-12-14 Thread Lewis John Mcgibbney
Hi Jon, On Mon, Dec 14, 2015 at 10:22 AM, wrote: > > I need to harvest blog posts and news articles and extract their date, the > author, the text, the title and the comments if possible. The way I see it > I have two choices, deploy a Nutch crawler or as a

Re: Index Page Locale

2015-12-14 Thread Lewis John Mcgibbney
Hi Manish, On Sat, Dec 12, 2015 at 6:22 AM, wrote: > > Ian using notch 1.10, I need to index page locale, I could see there is > plugin available for identifying page language but I need to index locale. > > Well I have a few answers. 1) Take a look at the

Fwd: ApacheCon NA 2015 Travel Assistance Applications now open!

2015-12-07 Thread Lewis John Mcgibbney
ns now open! 1251 by: lewis john mcgibbney Administrivia: - To post to the list, e-mail: priv...@nutch.apache.org To unsubscribe, e-mail: private-digest-unsubscr...@nutch.apache.org For additional commands, e-mail: private-di

Re: Chosing AWS instance for Nutch 1.X

2015-12-07 Thread Lewis John Mcgibbney
Hi Tien, Please see answers inline On Sat, Dec 5, 2015 at 9:34 AM, wrote: > > I'm setting the AWS cluster for Nutch 1.10 to crawl about 100M+ pages from > www. > OK, if I were you i would upgrade Nutch to 1.11... which has literally just been released. > >

[RESULT] WAS Re: [VOTE] Release Apache Nutch 1.11 RC#2

2015-12-07 Thread Lewis John Mcgibbney
Hi user@ dev@, 72hrs has lapsed so I would like to bring this thread to a close! VOTE's wee cast with the following RESULT [7] +1 Release this package as Apache Nutch 1.11 Lewis John Mcgibbney* Roannel Fernández Hernández Sujen Shah* Chris A Mattmann* Julien Nioche* Sebastian Nagel* Jorge Luis

[RELEASE] Apache Nutch 1.11

2015-12-07 Thread lewis john mcgibbney
Hello Folks, 07 December 2015 - Nutch 1.11 Release The Apache Nutch PMC are pleased to announce the immediate release of Apache Nutch v1.11, we advise all current users and developers of the 1.X series to upgrade to this release. What is Apache Nutch? Nutch is a well matured, production ready

[VOTE] Release Apache Nutch 1.11 RC#2

2015-12-04 Thread Lewis John Mcgibbney
-1.11-rc2/ All artifacts have been signed with the following signature as present within KEYS 48BAEBF6 2013-10-28 Lewis John McGibbney (CODE SIGNING KEY) < lewi...@apache.org> In addition, a staged maven repository is available here: https://repository.apache.org/content/repositories/orgapach

Re: How to use nutch 2.2.1 to crawl images

2015-12-03 Thread Lewis John Mcgibbney
Hi Byzen.Ma, I would advise you to follow the tutorial at http://wiki.apache.org/nutch/Nutch2Tutorial Please see the answers inline On Thu, Dec 3, 2015 at 6:26 AM, wrote: > > > storage.data.store.class > org.apache.gora.sql.store.SqlStore

Re: Nutch+Hbase on EMR CLASSPATH issue

2015-11-19 Thread Lewis John Mcgibbney
Hi Ketan, On Wed, Nov 18, 2015 at 2:00 AM, wrote: > > Nutch+Hbase on EMR CLASSPATH issue

Re: Score in SOLR Index allways 0.0

2015-11-10 Thread Lewis John Mcgibbney
Hi Martin, On Sun, Nov 8, 2015 at 1:28 PM, wrote: > > I am runing nutch 1.9 with solr 3.6.2 > > When I build an index I get allways 0.0 score > > In debug I see that the value of fieldNorm i set to 0.0. > > I tried out omitNorms for the field in Schema.xml no

Re: Populating outlinks with CrawlDatum Metadata

2015-11-05 Thread Lewis John Mcgibbney
Hi Julien, Yep I sure can :) Thank you. This must have been stored away somewhere in the back of my mind. Thanks Julien. Lewis On Thu, Nov 5, 2015 at 7:25 PM, wrote: > > Hi Lewis > > Can't you achieve this already with the url-meta plugin and urlmeta.tags >

Populating outlinks with CrawlDatum Metadata

2015-11-03 Thread Lewis John Mcgibbney
Hi Folks, The above has been discussed a few times with the following thread [0] being probably most helpful. Is anyone else working with another mechanism for populating CrawlDatum Metadata to Outlinks? I thought we (Julien) had implemented this feature and it was a case of turning it on within

[RESULT] WAS Re: [VOTE] Release Apache Nutch 2.3.1

2015-10-29 Thread Lewis John Mcgibbney
(please state why) Lewis John Mcgibbney* (this release candidate has a flaw in the crawl script) Sherban Drulea (cannot get it to run... this release candidate has a flaw in the crawl script)) Sebastian Nagel* (this release candidate has a flaw in the crawl script) *PMC Binding This VOTE therefore fails

Re: [VOTE] Release Apache Nutch 2.3.1

2015-10-29 Thread Lewis John Mcgibbney
this is good to go. I will send a RESULT thread then work on getting 2.3.1 RC #2 shipped. Thanks On Tue, Sep 22, 2015 at 6:45 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: > Hi user@ & dev@,This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1. > > We

Re: Apache Nutch Python-Nutchpy

2015-10-02 Thread Lewis John Mcgibbney
Hi Sanjay, On Fri, Oct 2, 2015 at 4:33 PM, wrote: > > I want to use the apache nutch python nutchpy library for analyzing the > crawl data generated from apache nutch. > Can anyone please point me to the documentation for nutchpy library that > how I can

Re: Apache Nutch Output structure

2015-10-02 Thread Lewis John Mcgibbney
Hi Folks, On Fri, Oct 2, 2015 at 4:33 PM, wrote: > > I already went through the page but it gives only technical information > about the directories but no information related to relation amongst these > folders and what they really mean in terms of crawled

Running JProfiler on Selenium Grid to Prevent System Overload

2015-10-01 Thread Lewis John Mcgibbney
Hi Folks, We've noticed that when running Selenium Grid with Nutch we can easily kill an Amazon cluster. Amazon basically mention that load can get too high on the frontend with "...too many java, firefox, and screen processes" which stopped the EMR healthchecks from properly working. This caused

Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

2015-09-30 Thread Lewis John Mcgibbney
Hi Sherban, On Wed, Sep 30, 2015 at 6:46 AM, wrote: > > I tried with SOLR 4.9.1. > OK. As I said Solr 4.6 is supported but never mind. > > I copied /release-2.3.1/runtime/local/conf/schema.xml to > solr-4.9.1/example/solr/collection1/conf/schema.xml >

Re: [VOTE] Release Apache Nutch 2.3.1

2015-09-30 Thread Lewis John Mcgibbney
Hi Folks, Is anyone else able to test and run the release candidate for 2.3.1? It would be great to get a release if we can get the VOTE's and the RC is suitable. Thanks in advance. Best Lewis On Wed, Sep 23, 2015 at 9:46 PM, Lewis John Mcgibbney < lewis.mcgibb...@gmail.com> wrote: >

Re: Unable to use nutch 2.3 crawl script for MySQL, Mongo, or Cassandra

2015-09-30 Thread Lewis John Mcgibbney
Hi Sherban, On Wed, Sep 30, 2015 at 5:41 PM, wrote: > > OK. I¹m using SOLR 4.6.0. > OK > > Caused by: org.apache.solr.common.SolrException: copyField source > :'rawcontent' is not a glob and doesn't match any explicit field or > dynamicField.. Schema file

Re: Unable to use notch 2.3 crawl script for MySQL, Mongo, or Cassandra

2015-09-29 Thread Lewis John Mcgibbney
Hi Sherban, On Mon, Sep 28, 2015 at 10:54 PM, wrote: > > I made progress. I downloaded and installed the release candidate from > https://svn.apache.org/repos/asf/nutch/tags/release-2.3.1 > OK great. > > > > plugin.includes > >

Re: Unable to use notch 2.3 crawl script for MySQL, Mongo, or Cassandra

2015-09-27 Thread Lewis John Mcgibbney
Hi Drulea, On Sun, Sep 27, 2015 at 7:36 AM, wrote: > > I’m using nutch 2.3 on OS X 10.9.5 with homebrew. > >From the start I would like to point you at the current release candidate for Nutch 2.3.1. The VOTE is currently open and the release candidate is

Re: Webcast : Apache Nutch on EMR

2015-09-25 Thread Lewis John Mcgibbney
Hi Julien, On Fri, Sep 25, 2015 at 5:12 AM, wrote: > > Hi again, > > I have uploaded at webcast explaining how to run Nutch on AWS Elastic Map > Reduce > > https://www.youtube.com/watch?v=v9zjcTjjjyU > > Please excuse the sound quality, hesitations and

Nutch File Formats

2015-09-24 Thread Lewis John Mcgibbney
Hi user@, Put a bit of time into updating and further describing our core data structures on the following page. https://wiki.apache.org/nutch/NutchFileFormats Kudos to previous Nutch developers who have produced these data

[VOTE] Release Apache Nutch 2.3.1

2015-09-22 Thread Lewis John Mcgibbney
Hi user@ & dev@,This thread is a VOTE for releasing Apache Nutch 2.3.1 RC#1. We addressed 32 issues in all which can been see at the release report http://s.apache.org/nutch_2.3.1 The release candidate comprises the following components. * A staging repository [0] containing various Maven

NUTCH-1946 Upgrade to Gora 0.6.1

2015-09-17 Thread Lewis John Mcgibbney
Hi user@ and dev@, Quick message to ask kindly for a call to arms. I pushed a patch to NUTCH-1946 [0] for Nutch 2.X HEAD [1] This includes - Upgrade to Gora 0.6.1 - Upgrade to Hadoop 2.5.1 (which Gora supports fully) see NUTCH-2101

Re: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah

2015-09-16 Thread Lewis John Mcgibbney
Nice Sujen. On Wed, Sep 16, 2015 at 7:08 AM, wrote: > > > Subject: [ANNOUNCE] New Nutch committer and PMC - Sujen Shah > Dear all, > > on behalf of the Nutch PMC it is my pleasure to announce > that Sujen Shah has been voted in as committer and member > of

[ANNOUNCE] Apache Gora 0.6.1 Release

2015-09-15 Thread lewis john mcgibbney
Hi All, The Apache Gora team are pleased to announce the immediate availability of Apache Gora 0.6.1. What is Gora? Gora is a framework which provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs,

Re: [ANNOUNCE] New Nutch committer and PMC - Asitang Mishra

2015-09-10 Thread Lewis John Mcgibbney
Nice one Asitang :) On Thu, Sep 10, 2015 at 9:12 AM, wrote: > > > Dear all, > > on behalf of the Nutch PMC it is my pleasure to announce > that Asitang Mishra has joined the Nutch team as committer > and PMC member. Asitang, please feel free to introduce >

Re: Problems indexing to solr 3.5 from nutch 1.8

2015-09-05 Thread Lewis John Mcgibbney
Hi Guy, The schema is present in the conf directory as shown here https://github.com/apache/nutch/blob/trunk/conf/schema.xml Lewis On Thu, Sep 3, 2015 at 11:13 AM, wrote: > > Subject: Re: Problems indexing to solr 3.5 from nutch 1.8 > Having a similar

<    1   2   3   4   5   6   7   8   9   10   >