tittutomen wrote:
Hi,
I've been trying to set up a Nutch-hadoop distributed environment to crawl
a 3 Million URL list.
My experience so far been is:
1. Nutch is working fine on a single machine environ. Here I wrote a
script file which calls nutch crawl command first to crawl 1000
Hey,
Can anyone tell what could be the reason for following which happened while
fetching data using bin/nutch fetch:
My AVG Antivirus is detecting virus threats while Nutch fetches pages from
available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb.
Antivirus already detected
Gaurang,
About that AVG alerts - you are fetching web pages together with all
viruses they may be infected with.
Of course, antivirus software will scream about it.
I wouldn't run any kind of such software on crawling machine.
პატივისცემით,
დავით ჯაში
On Tue, Oct 6, 2009 at 12:36, Gaurang
Hi there,
there used to be a bin/nutch prune tool? if so what happened to it?
Or does someone have a batch script to run the prune command?
Thanks,
fadzi
Hi,
I am trying to configure nutch and hadoop on 2 node. But while trying
to fetch, i am getting this exception. (same exception i am getting
sometime while injecting new seed)
2009-10-06 14:56:51,609 WARN mapred.ReduceTask -
java.io.FileNotFoundException: http://127.0.0.1:50060/mapOutput?
bhavin pandya-3 wrote:
Hi,
I am trying to configure nutch and hadoop on 2 node. But while trying
to fetch, i am getting this exception. (same exception i am getting
sometime while injecting new seed)
2009-10-06 14:56:51,609 WARN mapred.ReduceTask -
java.io.FileNotFoundException:
Don't change options in nutch-default.xml - copy the option into
nutch-site.xml and change it there. That way the change will
(hopefully) survive an upgrade.
On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel gaurangtpa...@gmail.com wrote:
Hey,
Never mind. I got *generate.update.db* in
hi,
i forget to say that when the errors happen, and the crawling stops it creates
the folder 'dedup-urls-485515157'
can some one tell me when using 'ant' what will we do after that ?? concerning
jars , build ...etc
thx
From: mbel...@msn.com
To: nutch-user@lucene.apache.org
Subject:
All-
Idea on how to configure nutch to generate/fetch on multiple machines
simultaneously?
-Gaurang
This could be done much simpler with a modified Generator that outputs
multiple segments from one job, but it's not implemented yet.
This would also be more efficient as crawlDB operations such as generate or
update take more time as the crawlDB grows (unlike fetch and parse which are
yes, using a hadoop cluster. I would recommend the tutorial called
NutchHadoopTutorial on the wiki.
On Oct 6, 2009, at 8:56 AM, Gaurang Patel wrote:
All-
Idea on how to configure nutch to generate/fetch on multiple machines
simultaneously?
-Gaurang
Has anyone written a script for whole web crawling using Hadoop? The
script for nutch doesn't work since the data is inside the HDFS (tail -
f wont work with this).
Thanks,
Eric
Sorry Ryan,
I should have clarified that I am using Nutch as my crawler. There is
a script for Nutch to do Whole web crawling, but it is not compatible
with Hadoop.
Eric Osgood
-
Cal Poly - Computer Engineering
Moon Valley Software
Is there a way to inspect the list of links that nutch finds per page
and then at that point choose which links I want to include / exclude?
that is the ideal remedy to my problem.
Eric Osgood
-
Cal Poly - Computer Engineering
Moon Valley Software
hi,
yes you can use this command:
./bin/nutch readdb crawl_sqla/crawldb/ -dump whole_db
it will gives you all the urls
Number of urls in the crawl database.
Gaurang Patel
Mon, 05 Oct 2009
Eric Osgood wrote:
Is there a way to inspect the list of links that nutch finds per page
and then at that point choose which links I want to include / exclude?
that is the ideal remedy to my problem.
Yes, look at ParseOutputFormat, you can make this decision there. There
are two standard
Andrzej,
How would I check for a flag during fetch?
Maybe this explanation can shed some light:
Ideally, I would like to check the list of links for each page, but
still needing a total of X links per page, if I find the links I want,
I add them to the list up until X, if I don' reach X, I
17 matches
Mail list logo