Re: Nutch - DFS environment. Is it stable?

2009-10-06 Thread tittutomen
tittutomen wrote: Hi, I've been trying to set up a Nutch-hadoop distributed environment to crawl a 3 Million URL list. My experience so far been is: 1. Nutch is working fine on a single machine environ. Here I wrote a script file which calls nutch crawl command first to crawl 1000

Authenticity of URLs from DMOZ

2009-10-06 Thread Gaurang Patel
Hey, Can anyone tell what could be the reason for following which happened while fetching data using bin/nutch fetch: My AVG Antivirus is detecting virus threats while Nutch fetches pages from available urls of *crawldb.* I injected DMOZ Open Directory urls to crawldb. Antivirus already detected

Re: Authenticity of URLs from DMOZ

2009-10-06 Thread David Jashi
Gaurang, About that AVG alerts - you are fetching web pages together with all viruses they may be infected with. Of course, antivirus software will scream about it. I wouldn't run any kind of such software on crawling machine. პატივისცემით, დავით ჯაში On Tue, Oct 6, 2009 at 12:36, Gaurang

prune tool

2009-10-06 Thread Fadzi Ushewokunze
Hi there, there used to be a bin/nutch prune tool? if so what happened to it? Or does someone have a batch script to run the prune command? Thanks, fadzi

mapred.ReduceTask - java.io.FileNotFoundException

2009-10-06 Thread bhavin pandya
Hi, I am trying to configure nutch and hadoop on 2 node. But while trying to fetch, i am getting this exception. (same exception i am getting sometime while injecting new seed) 2009-10-06 14:56:51,609 WARN mapred.ReduceTask - java.io.FileNotFoundException: http://127.0.0.1:50060/mapOutput?

Re: mapred.ReduceTask - java.io.FileNotFoundException

2009-10-06 Thread tittutomen
bhavin pandya-3 wrote: Hi, I am trying to configure nutch and hadoop on 2 node. But while trying to fetch, i am getting this exception. (same exception i am getting sometime while injecting new seed) 2009-10-06 14:56:51,609 WARN mapred.ReduceTask - java.io.FileNotFoundException:

Re: Incremental Whole Web Crawling

2009-10-06 Thread Paul Tomblin
Don't change options in nutch-default.xml - copy the option into nutch-site.xml and change it there. That way the change will (hopefully) survive an upgrade. On Tue, Oct 6, 2009 at 1:01 AM, Gaurang Patel gaurangtpa...@gmail.com wrote: Hey, Never mind. I got *generate.update.db* in

RE: problem ending crawl nutch 1.0 - DeleteDuplicates

2009-10-06 Thread BELLINI ADAM
hi, i forget to say that when the errors happen, and the crawling stops it creates the folder 'dedup-urls-485515157' can some one tell me when using 'ant' what will we do after that ?? concerning jars , build ...etc thx From: mbel...@msn.com To: nutch-user@lucene.apache.org Subject:

generate/fetch using multiple machines

2009-10-06 Thread Gaurang Patel
All- Idea on how to configure nutch to generate/fetch on multiple machines simultaneously? -Gaurang

Re: Incremental Whole Web Crawling

2009-10-06 Thread Julien Nioche
This could be done much simpler with a modified Generator that outputs multiple segments from one job, but it's not implemented yet. This would also be more efficient as crawlDB operations such as generate or update take more time as the crawlDB grows (unlike fetch and parse which are

Re: generate/fetch using multiple machines

2009-10-06 Thread Eric
yes, using a hadoop cluster. I would recommend the tutorial called NutchHadoopTutorial on the wiki. On Oct 6, 2009, at 8:56 AM, Gaurang Patel wrote: All- Idea on how to configure nutch to generate/fetch on multiple machines simultaneously? -Gaurang

Hadoop Script

2009-10-06 Thread Eric
Has anyone written a script for whole web crawling using Hadoop? The script for nutch doesn't work since the data is inside the HDFS (tail - f wont work with this). Thanks, Eric

Re: Hadoop Script

2009-10-06 Thread Eric Osgood
Sorry Ryan, I should have clarified that I am using Nutch as my crawler. There is a script for Nutch to do Whole web crawling, but it is not compatible with Hadoop. Eric Osgood - Cal Poly - Computer Engineering Moon Valley Software

Targeting Specific Links

2009-10-06 Thread Eric Osgood
Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Eric Osgood - Cal Poly - Computer Engineering Moon Valley Software

RE: Number of urls in the crawl database.

2009-10-06 Thread BELLINI ADAM
hi, yes you can use this command: ./bin/nutch readdb crawl_sqla/crawldb/ -dump whole_db it will gives you all the urls Number of urls in the crawl database. Gaurang Patel Mon, 05 Oct 2009

Re: Targeting Specific Links

2009-10-06 Thread Andrzej Bialecki
Eric Osgood wrote: Is there a way to inspect the list of links that nutch finds per page and then at that point choose which links I want to include / exclude? that is the ideal remedy to my problem. Yes, look at ParseOutputFormat, you can make this decision there. There are two standard

Re: Targeting Specific Links

2009-10-06 Thread Eric Osgood
Andrzej, How would I check for a flag during fetch? Maybe this explanation can shed some light: Ideally, I would like to check the list of links for each page, but still needing a total of X links per page, if I find the links I want, I add them to the list up until X, if I don' reach X, I