Hi,
Thanks for your prompt reply.
But as per readdb it has 3634 fetched pages.
status 1 (db_unfetched):80475
status 2 (db_fetched): 3634
While as per readseg if i add fetched/parsed pages for all segment it
comes to much more. (1 + 81 + 3691 + 84178 + 84178)
NAME
Hi,
I would use the following command to dump out the crawl database in a human
readable format:
nutch readdb crawl/crawldb -dump fooDir -format csv
I hope this helps,
Mischa
On 14 Dec 2009, at 22:30, Ted Yu wrote:
Hi,
I used crawl command of bin/nutch and obtained the following:
ls
Hi,
Suppose I have 3 plugins A, B and C. I want to execute plugin A first then
plugin B and at last plugin C. I specified plugin entries in nutch-site.xml
under 'include-plugins' tag as follows:
nameplugin.includes/name
OK thx, I can also remove the segments in the HDFS since I don't thing they
are used for further crawls or even during merge of indexed segments ?
That way I could save a lot space in keeping only one copy of the segments
data.
2009/12/14 Dennis Kubes ku...@apache.org
Index and segments is the
I wouldn't. If you want to reparse or analyze that content later you
are going to need the segments. True it saves space but the content is
the most important part for further analysis. If you know you are not
going to do any further analysis on it then yes, it can be deleted.
Dennis