Hi Leo, hi Lewis,
From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL,
This may be the reason. "Empty" segments may break some of the crawler steps.
But if I'm not wrong it looks like the updatedb-command
is not quite correct:
> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> updatedb /home/llist/nutchData/crawl/crawldb
> -dir /home/llist/nutchData/crawl/segments/20110721122519
> CrawlDb update: starting at 2011-07-21 12:28:03
> CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> CrawlDb update: segments:
> [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
> file:/home/llist/nutchData/crawl/segments/20110721122519/content,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
> file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
> CrawlDb update: additions allowed: true
As for other commands reading segments there are two ways two
add segments as arguments: 1) all segments enumarated or 2) via -dir the parent
directory
of all segments. See:
% $NUTCH_HOME/bin/nutch updatedb
Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter]
[-noAdditions]
crawldb CrawlDb to update
-dir segments parent directory containing all segments to update from
seg1 seg2 ... list of segment names to update from
Try your updatedb command without -dir, it should work.
Sebastian