Anyone have a good tutorial about deploying nutch (1.6) on a pre-existing
Hadoop cluster ?
Thanks.
Perhaps this could help:
http://www.rui-yang.com/develop/build-nutch-1-4-cluster-with-hadoop/
- Mensaje original -
De: Amit Sela am...@infolinks.com
Para: user@nutch.apache.org
Enviados: Jueves, 21 de Febrero 2013 5:00:29
Asunto: Deploy nutch on existing Hadoop cluster
Anyone have a
https://wiki.apache.org/nutch/NutchHadoopTutorial
basically follow the steps in
http://hadoop.apache.org/docs/stable/cluster_setup.html then install Nutch
on the master node of your cluster, 'cd runtime/deploy/bin' and use the
nutch scripts as usual. You can then use the standard Mapreduce webapp
I get it fine. I do think it important to discuss the current filtering
code in the generator though. Yeah, okay, it turns out that our current
implementation (which reads all entries then does filtering on Nutch side)
can be horribly expensive but at least there is some mechanism in place
right?
Hi Julien,
the point I personally don't get, is: why is generating fast - fetching not.
If it's possible to filter the generatorJob at the backend (what I think
it does), shouldn't it be possible to do the same for the fetcher?
--Roland
Am 21.02.2013 12:27, schrieb Julien Nioche:
Lewis,
I basically just built with ant and copied the contents of deploy (job file
+ nutch and crawl scripts) to nutch folder in my hadoop-user directory on
the master.
I changed the crawl script to work only in distributed mode and it seems to
work... though I am getting a lot of Child Error exceptions
Welcome to the world of post 1.3 Nutch ;)
On Thursday, February 21, 2013, Amit Sela am...@infolinks.com wrote:
I basically just built with ant and copied the contents of deploy (job
file
+ nutch and crawl scripts) to nutch folder in my hadoop-user directory
on
the master.
I changed the crawl
Has anyone encounter this error before:
org.apache.gora.util.GoraException:
org.apache.hadoop.hbase.ZooKeeperConnectionException: HBase is able to
connect to ZooKeeper but the connection closes immediately. This could
be a sign that the server has too many connections (30 is the default).
replace
dependency org=org.apache.gora name=gora-hbase rev=0.2.1
conf=*-default /
with
dependency org=org.apache.gora name=gora-hbase rev=0.2.1
conf=*-default
exclude org=org.apache.hbase name=hbase rev=0.90.4
include org=org.apache.hbase name=hbase rev=0.90.6
/dependency
hopefully something
this is what i ended up in my ivy.xml:
dependency org=org.apache.gora name=gora-hbase rev=0.2.1
conf=*-default
!-- exclude org=org.apache.hbase name=hbase rev=0.90.4 /--
include org=org.apache.hbase name=hbase rev=0.90.6 /
/dependency
which caused this:
BUILD FAILED
hello,
I finally crossed all the terminal issues and I can run Nutch and Solr with
no problems from the command line.
When I try to implement Nutch crawling from JAVA, it's a different story.
The error message is pretty self-explanatory:
/Fetcher: No agents listed in 'http.agent.name'
Try this
dependency org=org.apache.gora name=gora-hbase rev=0.2.1
conf=*-default
exclude org=org.apache.hbase name=hbase rev=0.90.4 /
/dependency
...
dependency org=org.apache.hbase name=hbase rev=0.90.6
conf=*-default
hth
On Thu, Feb 21, 2013 at 11:41 AM, kaveh minooie
http://svn.apache.org/repos/asf/nutch/tags/release-1.6/src/java/org/apache/nutch/util/NutchConfiguration.java
On Thu, Feb 21, 2013 at 12:03 PM, imehesz imeh...@gmail.com wrote:
hello,
I finally crossed all the terminal issues and I can run Nutch and Solr with
no problems from the command
Hi,
So where is Nutch in Java loading the configuration file from? (and how can
I overwrite it)
– configuration files are found via Java’s classpath
– only the first instance of each file found in one
of the directories of the classpath is used
– settings in nutch-site.xml overwrite
thanks man, this worked:
dependency org=org.apache.gora name=gora-hbase rev=0.2.1
conf=*-default
exclude org=org.apache.hbase name=hbase /
/dependency
dependency org=org.apache.hbase name=hbase rev=0.90.5
conf=*-default /
turned out maven repo1 doesn't
I take it you read the mailing list thread I pointed you to?
Unfortunately, although linked problems of this nature as not solved
directly with Nutch.
On Thu, Feb 21, 2013 at 3:47 PM, kaveh minooie ka...@plutoz.com wrote:
thanks man, this worked:
dependency org=org.apache.gora name=gora-hbase
Thank you Tejas.
Your tips helped a lot.
One more thing is, after building, the plugin.folder property should point
to build/plugins for executing the crawl.
Now it crawling fine. My concern is to locate object which has the content
and its metadata so that I can capture that and direct to my
17 matches
Mail list logo