Did you check the logs (NUTCH_HOME/logs/hadoop.log) for any exception or
error messages ?
Also you might have a look at these configs in nutch-site.xml (default
values are in nutch-default.xml):
http.content.limit and parser.html.impl


On Tue, Jun 25, 2013 at 7:04 AM, Sznajder ForMailingList <
[email protected]> wrote:

> Hello
>
> I installed Nutch 2.2 on my linux machine.
>
> I defined the seed directory with one file containing:
> http://en.wikipedia.org/
> http://edition.cnn.com/
>
>
> I ran the following:
> sh bin/nutch inject ~/DataExplorerCrawl_gpfs/seed/
>
> After this step:
> the call
> -bash-4.1$ sh bin/nutch readdb -stats
>
> returns
> TOTAL urls:     2
> status 0 (null):        2
> avg score:      1.0
>
>
> Then, I ran the following:
> bin/nutch generate -topN 10
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb
> bin/nutch generate -topN 1000
> bin/nutch fetch -all
> bin/nutch parse -all
> bin/nutch updatedb
>
>
> However, the stats call after these steps is still:
> the call
> -bash-4.1$ sh bin/nutch readdb -stats
> status 5 (status_redir_perm):   1
> max score:      2.0
> TOTAL urls:     3
> avg score:      1.3333334
>
>
>
> Only 3 urls?!
> What do I miss?
>
> thanks
>
> Benjamin
>

Reply via email to