Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Julien Nioche
Hi Lewis, A few comments below. I use Nutch 2.x as it enables me to do analytics over the data I am crawling. This is my justification for trying to maintain an further the development on that branch over the last while. Just out of interest, what sort of analytics do you do and why is it

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
Hi Meraj, The generator will place all the URLs in a single segment if all they belong to the same host for politeness reason. Otherwise it will use whichever value is passed with the -numFetchers parameter in the generation step. Why don't you use the crawl script in /bin instead of tinkering

How do I pass custom URL filter URL configuration to filter plugins?

2014-08-29 Thread Krishnanand, Kartik
Hi, Nutch Gurus, I have a use case that I need to implement and I hope that someone can help. I have a situation where I need to generate and build URLs dynamically and pass them to the respective filter. I want to pass a newly constructed string to the Filter implementation associated with

Nutch Confusion

2014-08-29 Thread Iqbal Shaikh
Hi All, Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1 version. We would be indexing our crawled data in ElasticSearch 1.x version. I know the 2.2.1 version provides OTB support for Elastic 0.x version but to use 2.x I need to change the code (ElasticWriter.java) This

Re: Nutch Confusion

2014-08-29 Thread Julien Nioche
Hi Iqbal, Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1 version. We would be indexing our crawled data in ElasticSearch 1.x version. I know the 2.2.1 version provides OTB support for Elastic 0.x version but to use 2.x I need to change the code (ElasticWriter.java)

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Meraj A. Khan
Hi Julien, I have 15 domains and they are all being fetched in a single map task which does not fetch all the urls no matter what depth or topN i give. I am submitting the Nutch job jar which seems to be using the Crawl.java class, how do I use the Crawl script on a Hadoop cluster, are there any

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script just go to runtime/deploy/bin and run the script from there. Julien On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote: Hi Julien, I have 15 domains and they are all being fetched in a single map task

RE: Nutch Confusion

2014-08-29 Thread Iqbal Shaikh
Thanks Julien for the prompt response. Actually since the model for 1.9 version is all plugin based I shouldn't be expecting an ivy.xml like in 2.x to have a elastic config. So ignore that comment. Yes I mean HDFS (new to big data and Hadoop). Isn't HBase the default one for 1.9 too ?

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread S.L
Thanks, can this be used on a hadoop cluster? Sent from my HTC - Reply message - From: Julien Nioche lists.digitalpeb...@gmail.com To: user@nutch.apache.org user@nutch.apache.org Subject: Nutch 1.7 fetch happening in a single map task. Date: Fri, Aug 29, 2014 9:00 AM See

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread Julien Nioche
As the name runtime/deploy suggest - it is used exactly for that purpose ;-) Just make sure HADOOP_HOME/bin is added to the path and run the script, that's all. Look at the bottom of the nutch script for details. Julien PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (

Re: Nutch Confusion

2014-08-29 Thread Ali Nazemian
Dear Iqbal, Hi, As far as I know, If you dont need Gora mapper for using Nutch over Hbase or MySQL or etc. , it is better to use version 1.x since some of Nutch functionality are not implemented on version 2.x and Nutch 1.x provides better performance for crawling web pages. ES is not difficult

Re: Nutch 1.7 fetch happening in a single map task.

2014-08-29 Thread S.L
Sorry Julien , I overlooked the directory names. My understanding is that the Hadoop Job is submitted to a cluster by using the following command on the RM node bin/hadoop .job file params Are you suggesting I submit the script instead of the Nutch .job jar like below? bin/hadoop bin/crawl

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Mattmann, Chris A (3980)
+1, great. I'd like to have a conversation about versioning. Since we're at 1.9, my suggestion would be to have the next in the trunk series (1.x) move to version 3.x post 1.9 for the release. Nutch2 remains Nutch and can be worked on there. That would give us a nice split in the diversionary

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Bin Wang
I think it is a great idea of build images with an environment correctly set up. I think two types of images would be helpful. 1. Development (Virtualbox) Here, we have Eclipse, plugin, pseudo hadoop...etc correctly installed maybe on a ubuntu box with 3D-acceleration enabled. Then people can

Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Lewis John Mcgibbney
Hi Nicholas, NOTE: Thread name has changed to reflect diversion on topic. On Fri, Aug 29, 2014 at 6:01 AM, user-digest-h...@nutch.apache.org wrote: will you use config management like ansible backing vagrant? Well thanks for the links here. The github repos they have indicates that they

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Lewis John Mcgibbney
Hi Julien, On Fri, Aug 29, 2014 at 6:01 AM, user-digest-h...@nutch.apache.org wrote: Just out of interest, what sort of analytics do you do and why is it better to do it in 2.x than 1.x? Nowhere did I say it was better or worse than in 1.X. Let me be clear here. I use Nutch 2.X, as I

Re: Nutch 2.X Vagrent WAS Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Nicholas Roberts
Hi Lewis I have a main development server and connect to it from other machines over Lan and sometimes via Wan Vagrant by itself has highly constrained network config. you need the proprietary cloud vagrant share which does ssh tunnels or you use config management for networking A vagrant file

Re: [RELEASE] Apache Nutch 1.9

2014-08-29 Thread Guy McDowell
I'm confused as to what are the significant differences between 1.x and 2.x. Is there a bit of history that I could read about why the development of the two parallel to each other happened? As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which path would be best for me to

Re: New documents still not being added by nutch

2014-08-29 Thread Paul Rogers
Hi Guys I'm still struggling with this. In summary my directory structure is as follows / |_doccontrol |_DC-10 Incoming Correspondence |_DC-11 Outgoing Correspondence If when I first run nutch the folders DC-10 and DC-11 contain all the files to be indexed then nutch crawls everything