Deploy nutch on existing Hadoop cluster

2013-02-21 Thread Amit Sela
Anyone have a good tutorial about deploying nutch (1.6) on a pre-existing Hadoop cluster ? Thanks.

Re: Deploy nutch on existing Hadoop cluster

2013-02-21 Thread Jorge Luis Betancourt Gonzalez
Perhaps this could help: http://www.rui-yang.com/develop/build-nutch-1-4-cluster-with-hadoop/ - Mensaje original - De: Amit Sela am...@infolinks.com Para: user@nutch.apache.org Enviados: Jueves, 21 de Febrero 2013 5:00:29 Asunto: Deploy nutch on existing Hadoop cluster Anyone have a

Re: Deploy nutch on existing Hadoop cluster

2013-02-21 Thread Julien Nioche
https://wiki.apache.org/nutch/NutchHadoopTutorial basically follow the steps in http://hadoop.apache.org/docs/stable/cluster_setup.html then install Nutch on the master node of your cluster, 'cd runtime/deploy/bin' and use the nutch scripts as usual. You can then use the standard Mapreduce webapp

Re: nutch with cassandra internal network usage

2013-02-21 Thread Lewis John Mcgibbney
I get it fine. I do think it important to discuss the current filtering code in the generator though. Yeah, okay, it turns out that our current implementation (which reads all entries then does filtering on Nutch side) can be horribly expensive but at least there is some mechanism in place right?

Re: nutch with cassandra internal network usage

2013-02-21 Thread Roland
Hi Julien, the point I personally don't get, is: why is generating fast - fetching not. If it's possible to filter the generatorJob at the backend (what I think it does), shouldn't it be possible to do the same for the fetcher? --Roland Am 21.02.2013 12:27, schrieb Julien Nioche: Lewis,

Re: Deploy nutch on existing Hadoop cluster

2013-02-21 Thread Amit Sela
I basically just built with ant and copied the contents of deploy (job file + nutch and crawl scripts) to nutch folder in my hadoop-user directory on the master. I changed the crawl script to work only in distributed mode and it seems to work... though I am getting a lot of Child Error exceptions

Re: Deploy nutch on existing Hadoop cluster

2013-02-21 Thread Lewis John Mcgibbney
Welcome to the world of post 1.3 Nutch ;) On Thursday, February 21, 2013, Amit Sela am...@infolinks.com wrote: I basically just built with ant and copied the contents of deploy (job file + nutch and crawl scripts) to nutch folder in my hadoop-user directory on the master. I changed the crawl

gora zookeeper error

2013-02-21 Thread kaveh minooie
Has anyone encounter this error before: org.apache.gora.util.GoraException: org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to connect to ZooKeeper but the connection closes immediately. This could be a sign that the server has too many connections (30 is the default).

Re: gora zookeeper error

2013-02-21 Thread Lewis John Mcgibbney
replace dependency org=org.apache.gora name=gora-hbase rev=0.2.1 conf=*-default / with dependency org=org.apache.gora name=gora-hbase rev=0.2.1 conf=*-default exclude org=org.apache.hbase name=hbase rev=0.90.4 include org=org.apache.hbase name=hbase rev=0.90.6 /dependency hopefully something

Re: gora zookeeper error

2013-02-21 Thread kaveh minooie
this is what i ended up in my ivy.xml: dependency org=org.apache.gora name=gora-hbase rev=0.2.1 conf=*-default !-- exclude org=org.apache.hbase name=hbase rev=0.90.4 /-- include org=org.apache.hbase name=hbase rev=0.90.6 / /dependency which caused this: BUILD FAILED

Nutch 1.6 with Java - not loading correct configuration file

2013-02-21 Thread imehesz
hello, I finally crossed all the terminal issues and I can run Nutch and Solr with no problems from the command line. When I try to implement Nutch crawling from JAVA, it's a different story. The error message is pretty self-explanatory: /Fetcher: No agents listed in 'http.agent.name'

Re: gora zookeeper error

2013-02-21 Thread Lewis John Mcgibbney
Try this dependency org=org.apache.gora name=gora-hbase rev=0.2.1 conf=*-default exclude org=org.apache.hbase name=hbase rev=0.90.4 / /dependency ... dependency org=org.apache.hbase name=hbase rev=0.90.6 conf=*-default hth On Thu, Feb 21, 2013 at 11:41 AM, kaveh minooie

Re: Nutch 1.6 with Java - not loading correct configuration file

2013-02-21 Thread Lewis John Mcgibbney
http://svn.apache.org/repos/asf/nutch/tags/release-1.6/src/java/org/apache/nutch/util/NutchConfiguration.java On Thu, Feb 21, 2013 at 12:03 PM, imehesz imeh...@gmail.com wrote: hello, I finally crossed all the terminal issues and I can run Nutch and Solr with no problems from the command

Re: Nutch 1.6 with Java - not loading correct configuration file

2013-02-21 Thread Sebastian Nagel
Hi, So where is Nutch in Java loading the configuration file from? (and how can I overwrite it) – configuration files are found via Java’s classpath – only the first instance of each file found in one of the directories of the classpath is used – settings in nutch-site.xml overwrite

Re: gora zookeeper error

2013-02-21 Thread kaveh minooie
thanks man, this worked: dependency org=org.apache.gora name=gora-hbase rev=0.2.1 conf=*-default exclude org=org.apache.hbase name=hbase / /dependency dependency org=org.apache.hbase name=hbase rev=0.90.5 conf=*-default / turned out maven repo1 doesn't

Re: gora zookeeper error

2013-02-21 Thread Lewis John Mcgibbney
I take it you read the mailing list thread I pointed you to? Unfortunately, although linked problems of this nature as not solved directly with Nutch. On Thu, Feb 21, 2013 at 3:47 PM, kaveh minooie ka...@plutoz.com wrote: thanks man, this worked: dependency org=org.apache.gora name=gora-hbase

Re: Customizing Nutch 1.5 in Eclipse Juno

2013-02-21 Thread प्रशांत मोरे
Thank you Tejas. Your tips helped a lot. One more thing is, after building, the plugin.folder property should point to build/plugins for executing the crawl. Now it crawling fine. My concern is to locate object which has the content and its metadata so that I can capture that and direct to my