Re: Why readdb and readseg shows different figures?

2009-12-15 Thread bhavin pandya
Hi, Thanks for your prompt reply. But as per readdb it has 3634 fetched pages. status 1 (db_unfetched):80475 status 2 (db_fetched): 3634 While as per readseg if i add fetched/parsed pages for all segment it comes to much more. (1 + 81 + 3691 + 84178 + 84178) NAME

Re: converting nutch crawl output to human readable content

2009-12-15 Thread Mischa Tuffield
Hi, I would use the following command to dump out the crawl database in a human readable format: nutch readdb crawl/crawldb -dump fooDir -format csv I hope this helps, Mischa On 14 Dec 2009, at 22:30, Ted Yu wrote: Hi, I used crawl command of bin/nutch and obtained the following: ls

Is there a way to set a plugin execution order in Nutch?

2009-12-15 Thread Rupesh Mankar
Hi, Suppose I have 3 plugins A, B and C. I want to execute plugin A first then plugin B and at last plugin C. I specified plugin entries in nutch-site.xml under 'include-plugins' tag as follows: nameplugin.includes/name

Re: Distributed Search problem

2009-12-15 Thread MilleBii
OK thx, I can also remove the segments in the HDFS since I don't thing they are used for further crawls or even during merge of indexed segments ? That way I could save a lot space in keeping only one copy of the segments data. 2009/12/14 Dennis Kubes ku...@apache.org Index and segments is the

Re: Distributed Search problem

2009-12-15 Thread Dennis Kubes
I wouldn't. If you want to reparse or analyze that content later you are going to need the segments. True it saves space but the content is the most important part for further analysis. If you know you are not going to do any further analysis on it then yes, it can be deleted. Dennis