Nutch generating less URLs for fetcher to fetch (running in Hadoop mode)

2016-04-13 Thread Karanjeet Singh
Hello,

I am trying to crawl a website using Nutch on Hadoop cluster. I have
modified the crawl script to restrict the sizeFetchList to 1000 (which is
the topN value for nutch generate command).

However, as I see, Nutch is only generating 62 URLs where the unfetched URL
count is 5,000 (approx). I am using the below command:

nutch generate -D mapreduce.job.reduces=16 -D mapreduce.job.maps=8 -D
mapred.child.java.opts=-Xmx8192m -D mapreduce.map.memory.mb=8192 -D
mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D
mapreduce.map.output.compress=true crawl/crawldb crawl/segments -topN 1000
-numFetchers 1 -noFilter

Can anyone please look into this and let me know if I am missing something.
Please find the crawl configuration here [0].

[0]: https://github.com/karanjeets/crawl-evaluation/tree/master/nutch/conf

Thanks & Regards,
Karanjeet Singh
USC
ᐧ


nutch-selenium

2016-04-13 Thread Teena Antony
Hello,

   I am using Nutch 2. 3.1. I tried to install protocol selenium using this 
url 'https://github.com/apache/nutch/tree/trunk/src/plugin/protocol-selenium.' 
When I try command

sudo /usr/bin/Xvfb :11 -screen 0 1024x768x24 &

I keep getting stuck at the  'Initializing built-in extension GLX' (8+ hours) .



So I kind of moved on and did the rest of the tutorial. I try to crawl and I 
get this error

Java.lang.RuntimeException:org.openqa.selenium.remote.UnreachableBrowserException:
 could not start a new session. Possible causes are invalid address of the 
remote server or browser-start up failure.



When I try xvfb-run firefox http://google.com I get this 
error

Xlib:  extension "RANDR" missing on display ":99".

Xlib:  extension "RANDR" missing on display ":99".

Xlib:  extension "RANDR" missing on display ":99".

libGL error: No matching fbConfigs or visuals found

libGL error: failed to load driver: swrast



Please help.

Teena



Re: Adding a new field to Nutch + MongoDB datastore using plugin

2016-04-13 Thread Lewis John Mcgibbney
Hi jvence,
Please see my reply below

On Wed, Apr 13, 2016 at 8:26 AM,  wrote:

>
> From: jvence 
> To: user@nutch.apache.org
> Cc:
> Date: Tue, 12 Apr 2016 10:17:20 -0700 (MST)
> Subject: Adding a new field to Nutch + MongoDB datastore using plugin
> I am running Nutch 2.3.1 configured with MondoDB (using Gora) +
> Elasticsearch
> and would like to add a new field to the storage database NOT the index.
>

Cool. Please see below.


>
> I am able to add a field to the elasticsearch index using a custom plugin
> but would like to add it to the mongodb record for each website.
>
> I've added the field to the ./conf/schema.xml file and to
>

This relates to Solr only. If you have indexer-solr included in
plugin.includes then your field will be added to the Index. This has not
got anything to do with the Gora DataStore however.


> ./conf/gora-mongodb-mapping.xml - The field does appear in the index but
> not
> in the mongo record..
>

In addition to augmenting the mapping file, you need to
augment the webpage.avsc [0] as this essentially defines the data model you
wish to persist into Gora. We call this the persistent class. If you add
your data structure (in accordance with the Avro Specification [1]) then
run the following from $NUTCH_HOME then you will be good to go.

ant generate-gora-src

Any issues, please let us know.
Thanks

[0] https://github.com/apache/nutch/blob/2.x/src/gora/webpage.avsc
[1] https://avro.apache.org/docs/current/spec.html


Re: [CIS-CMMI-3] Re: [CIS-CMMI-3] Enabling/configuring Nutch logging?

2016-04-13 Thread Lewis John Mcgibbney
Hi Kshitij,

On Wed, Apr 13, 2016 at 5:36 AM, Kshitij Shukla 
wrote:

> Thanks for your reply Lewis,
>
> Regarding your points:
> 1) I am already using parameterized messaging convention.
>

>From your line of Java code... you were not. You posted the following

*LOG.debug("Found keys :" + lcMetatag + "\t" + value);*

Parameterized message notation would be as follows

*LOG.debug("Found keys : {} \t {}", lcMetatag, value);*


2) I tried adding class explicitly to the log4j.properties.
>

So can you paste a snipped of your log4j.properties?


> 3) I have rebuild nutch followed by a cluster restart.
>

There is no need for a cluster restart. All you need to do is deploy the
new .job file and you should be good to go. What is always a good option,
it to make sure that stuff is working locally prior to deployment on a
cluster. You may be able to easily test and debug your code by using the
parsechecker tool.


>
> Still i am unable to see the output of this line below anywhere in console
> and logs!
> LOG.debug("Found keys");
>

You are not looking for this line of logging though right??? You are
looking for LOG.debug("Found keys : {} \t {}", lcMetatag, value);
Lewis


RE: HTTPS Problem even using httpclient

2016-04-13 Thread Markus Jelsma
Hello - maybe there is a firewall or was there a temporary network issue? We 
have no trouble with Nutch on that site.
Markus
 
 
-Original message-
> From:Bin Wang 
> Sent: Tuesday 12th April 2016 21:41
> To: Apache.Nutch.User 
> Subject: HTTPS Problem even using httpclient
> 
> Hi there,
> 
> I am testing Nutch against a blog. https://datafireball.com/
> 
> I added the link to the seed.txt and left the regex-urlfilter the way it
> is. I replaced protocol-http with protocol-httpclient and thought that will
> make it capable of fetching https links. However, it failed with the
> following error after I executed the crawl command:
> 
> $ bin/crawl urls/ crawldir 3
> 
> fetcher.maxNum.threads can't be < than 50 : using 50 instead
> robots.txt whitelist not configured.
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=1
> fetch of https://datafireball.com/ failed with:
> org.apache.commons.httpclient.NoHttpResponseException: The server
> datafireball.com failed to respond
> Thread FetcherThread has no more work available
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0,
> fetchQueues.getQueueCount=0
> -activeThreads=0
> 
> I am pretty positive that the blog was functioning really well but couldn't
> really get that much help from the internet.
> 
> Can anyone give me some guide.
> 
> Below is the nutch-site.xml that I was using.
> 
> Best regards,
> 
> Bin
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
>   http.agent.name
> 
>   Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36
> (KHTML, like Gecko) Chrome/49.0.2623.112 Safari/537.36
> 
> 
> 
> 
> 
>   db.ignore.internal.links
> 
>   false
> 
> 
> 
> 
> 
>   plugin.includes
> 
> 
> protocol-httpclient|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)
> 
> 
> 
> 
> 
>   http.content.limit
> 
>   -1
> 
> 
> 
> 
> 
>   fetcher.server.delay
> 
>   0
> 
> 
> 
> 
> 
>   http.redirect.max
> 
>   5
> 
> 
> 
> 
> 
>   db.max.anchor.length
> 
>   1000
> 
> 
> 
> 
> 


[CIS-CMMI-3] Re: [CIS-CMMI-3] Enabling/configuring Nutch logging?

2016-04-13 Thread Kshitij Shukla

Thanks for your reply Lewis,

Regarding your points:
1) I am already using parameterized messaging convention.
2) I tried adding class explicitly to the log4j.properties.
3) I have rebuild nutch followed by a cluster restart.

Still i am unable to see the output of this line below anywhere in 
console and logs!

LOG.debug("Found keys");

BR.

On Tuesday 12 April 2016 01:42 AM, Lewis John Mcgibbney wrote:

Hi Kshitij,

On Mon, Apr 11, 2016 at 8:12 AM,  wrote:


I am working on developing a plugin for nutch. I have added some code to
see the output either in console or in logs like this:

*LOG.debug("Found keys :" + lcMetatag + "\t" + value);*


I would advise you to use the parameterized messaging convention for all of
your logging
http://www.slf4j.org/faq.html#logging_performance



I have also tried adding these 2 files in log4j.properties (& recompiling
nutch):

*# RootLogger - DailyRollingFileAppender**
**log4j.rootLogger=DEBUG,DRFA**
**
**log4j.logger.org.apache.nutch=DEBUG**



I would suggest that you explicitly add your class to the log4j.properties
file, examples of how to do this can be found below
https://github.com/apache/nutch/blob/trunk/conf/log4j.properties#L26-L61



*
But I cannot find the output neither in console nor in hadoop logs.


Please remember to rebuild your codebase after you change anything, this
will ensure that new files are packaged into the .job file for submission
to YARN.




--

Please let me know if you have any questions , concerns or updates.
Have a great day ahead :)

Thanks and Regards,

Kshitij Shukla
Software developer

*Cyber Infrastructure(CIS)
**/The RightSourcing Specialists with 1250 man years of experience!/*

DISCLAIMER:  INFORMATION PRIVACY is important for us, If you are not the 
intended recipient, you should delete this message and are notified that 
any disclosure, copying or distribution of this message, or taking any 
action based on it, is strictly prohibited by Law.


Please don't print this e-mail unless you really need to.

--

--

*Cyber Infrastructure (P) Limited, [CIS] *(CMMI Level 3 Certified)

Central India's largest Technology Company.

*Ensuring your success through our highly optimizedTechnology solutions.*

www.cisin.com | +Cisin  | Linkedin 
 | Offices:  
India | USA | Singapore | South Africa.

--

*** Please note that this message and any attachments may contain 
confidential and proprietary material and information and are intended only 
for the use of the intended recipient(s). If you are not the one, you 
should delete it immediately to avoid any copy write issues.