Re: skipping invalid segments nutch 1.3

Sebastian Nagel Thu, 21 Jul 2011 13:17:35 -0700

Hi Leo, hi Lewis,

From the times both the fetching and parsing took, I suspecting that maybe
Nutch didn't actually fetch the URL,


This may be the reason. "Empty" segments may break some of the crawler steps.

But if I'm not wrong it looks like the updatedb-command
is not quite correct:

> llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
> updatedb /home/llist/nutchData/crawl/crawldb
> -dir /home/llist/nutchData/crawl/segments/20110721122519
> CrawlDb update: starting at 2011-07-21 12:28:03
> CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
> CrawlDb update: segments:
> [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
> file:/home/llist/nutchData/crawl/segments/20110721122519/content,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
> file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
> file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
> CrawlDb update: additions allowed: true

As for other commands reading segments there are two ways two
add segments as arguments: 1) all segments enumarated or 2) via -dir the parent 
directory
of all segments. See:

% $NUTCH_HOME/bin/nutch updatedb

Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] [-filter][-noAdditions]

        crawldb CrawlDb to update
        -dir segments   parent directory containing all segments to update from
        seg1 seg2 ...   list of segment names to update from

Try your updatedb command without -dir, it should work.

Sebastian

Re: skipping invalid segments nutch 1.3

Reply via email to