Hi Lewis,
A few comments below.
I use Nutch 2.x as it enables me to do analytics over the data I am
crawling. This is my justification for trying to maintain an further the
development on that branch over the last while.
Just out of interest, what sort of analytics do you do and why is it
Hi Meraj,
The generator will place all the URLs in a single segment if all they
belong to the same host for politeness reason. Otherwise it will use
whichever value is passed with the -numFetchers parameter in the generation
step.
Why don't you use the crawl script in /bin instead of tinkering
Hi, Nutch Gurus,
I have a use case that I need to implement and I hope that someone can help.
I have a situation where I need to generate and build URLs dynamically and pass
them to the respective filter.
I want to pass a newly constructed string to the Filter implementation
associated with
Hi All,
Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1 version.
We would be indexing our crawled data in ElasticSearch 1.x version.
I know the 2.2.1 version provides OTB support for Elastic 0.x version but to
use 2.x I need to change the code (ElasticWriter.java) This
Hi Iqbal,
Am doing a POC to help decide if we should be using Nutch 1.9 or 2.2.1
version.
We would be indexing our crawled data in ElasticSearch 1.x version.
I know the 2.2.1 version provides OTB support for Elastic 0.x version but
to use 2.x I need to change the code (ElasticWriter.java)
Hi Julien,
I have 15 domains and they are all being fetched in a single map task which
does not fetch all the urls no matter what depth or topN i give.
I am submitting the Nutch job jar which seems to be using the Crawl.java
class, how do I use the Crawl script on a Hadoop cluster, are there any
See http://wiki.apache.org/nutch/NutchTutorial#A3.3._Using_the_crawl_script
just go to runtime/deploy/bin and run the script from there.
Julien
On 29 August 2014 13:38, Meraj A. Khan mera...@gmail.com wrote:
Hi Julien,
I have 15 domains and they are all being fetched in a single map task
Thanks Julien for the prompt response.
Actually since the model for 1.9 version is all plugin based I shouldn't be
expecting an ivy.xml like in 2.x to have a elastic config. So ignore that
comment.
Yes I mean HDFS (new to big data and Hadoop). Isn't HBase the default one for
1.9 too ?
Thanks, can this be used on a hadoop cluster?
Sent from my HTC
- Reply message -
From: Julien Nioche lists.digitalpeb...@gmail.com
To: user@nutch.apache.org user@nutch.apache.org
Subject: Nutch 1.7 fetch happening in a single map task.
Date: Fri, Aug 29, 2014 9:00 AM
See
As the name runtime/deploy suggest - it is used exactly for that purpose
;-) Just make sure HADOOP_HOME/bin is added to the path and run the script,
that's all.
Look at the bottom of the nutch script for details.
Julien
PS: there will be a Nutch tutorial at the forthcoming ApacheCon EU (
Dear Iqbal,
Hi,
As far as I know, If you dont need Gora mapper for using Nutch over Hbase
or MySQL or etc. , it is better to use version 1.x since some of Nutch
functionality are not implemented on version 2.x and Nutch 1.x provides
better performance for crawling web pages. ES is not difficult
Sorry Julien , I overlooked the directory names.
My understanding is that the Hadoop Job is submitted to a cluster by using
the following command on the RM node bin/hadoop .job file params
Are you suggesting I submit the script instead of the Nutch .job jar like
below?
bin/hadoop bin/crawl
+1, great.
I'd like to have a conversation about versioning.
Since we're at 1.9, my suggestion would be to have the
next in the trunk series (1.x) move to version 3.x post
1.9 for the release.
Nutch2 remains Nutch and can be worked on there. That
would give us a nice split in the diversionary
I think it is a great idea of build images with an environment correctly
set up. I think two types of images would be helpful.
1. Development (Virtualbox)
Here, we have Eclipse, plugin, pseudo hadoop...etc correctly installed
maybe on a ubuntu box with 3D-acceleration enabled. Then people can
Hi Nicholas,
NOTE: Thread name has changed to reflect diversion on topic.
On Fri, Aug 29, 2014 at 6:01 AM, user-digest-h...@nutch.apache.org wrote:
will you use config management like ansible backing vagrant?
Well thanks for the links here. The github repos they have indicates that
they
Hi Julien,
On Fri, Aug 29, 2014 at 6:01 AM, user-digest-h...@nutch.apache.org wrote:
Just out of interest, what sort of analytics do you do and why is it better
to do it in 2.x than 1.x?
Nowhere did I say it was better or worse than in 1.X. Let me be clear here.
I use Nutch 2.X, as I
Hi Lewis
I have a main development server and connect to it from other machines over
Lan and sometimes via Wan
Vagrant by itself has highly constrained network config. you need the
proprietary cloud vagrant share which does ssh tunnels or you use config
management for networking
A vagrant file
I'm confused as to what are the significant differences between 1.x and
2.x.
Is there a bit of history that I could read about why the development of
the two parallel to each other happened?
As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which
path would be best for me to
Hi Guys
I'm still struggling with this. In summary my directory structure is as
follows
/
|_doccontrol
|_DC-10 Incoming Correspondence
|_DC-11 Outgoing Correspondence
If when I first run nutch the folders DC-10 and DC-11 contain all the files
to be indexed then nutch crawls everything
19 matches
Mail list logo