Re: skipping invalid segments nutch 1.3

Leo Subscriptions Thu, 21 Jul 2011 17:41:51 -0700

Hi Sebastian,

I think the problem is with the fetch not returning any results. I
checked your suggestion, but it did not work.


Cheers,

Leo

On Thu, 2011-07-21 at 22:16 +0200, Sebastian Nagel wrote:

> Hi Leo, hi Lewis,
> 
> > From the times both the fetching and parsing took, I suspecting that maybe
> > Nutch didn't actually fetch the URL,
> 
> This may be the reason. "Empty" segments may break some of the crawler steps.
> 
> But if I'm not wrong it looks like the updatedb-command
> is not quite correct:
> 
>  > llist@LeosLinux:~/nutchData$ /usr/share/nutch/runtime/local/bin/nutch
>  > updatedb /home/llist/nutchData/crawl/crawldb
>  > -dir /home/llist/nutchData/crawl/segments/20110721122519
>  > CrawlDb update: starting at 2011-07-21 12:28:03
>  > CrawlDb update: db: /home/llist/nutchData/crawl/crawldb
>  > CrawlDb update: segments:
>  > [file:/home/llist/nutchData/crawl/segments/20110721122519/parse_text,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/content,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_parse,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/parse_data,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_fetch,
>  > file:/home/llist/nutchData/crawl/segments/20110721122519/crawl_generate]
>  > CrawlDb update: additions allowed: true
> 
> As for other commands reading segments there are two ways two
> add segments as arguments: 1) all segments enumarated or 2) via -dir the 
> parent directory
> of all segments. See:
> 
> % $NUTCH_HOME/bin/nutch updatedb
> Usage: CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] 
> [-normalize] [-filter] 
> [-noAdditions]
>          crawldb CrawlDb to update
>          -dir segments   parent directory containing all segments to update 
> from
>          seg1 seg2 ...   list of segment names to update from
> 
> Try your updatedb command without -dir, it should work.
> 
> Sebastian

Re: skipping invalid segments nutch 1.3

Reply via email to