Re: Nutch crawl vs other commands

lewis john mcgibbney Thu, 22 Sep 2011 07:45:20 -0700

Hi Bai,

You haven't mentioned which Nutch version you're using... this would be good
if you could.

You haven't injected any seed URLs into your crawldb. From memory I think
the -topN parameter should be passed to the generate command.

Just to note, it is not necessary to set noParsing while executing the fetch
command. This is already default behaviour. Not sure why your machine is
churning but this shouldn't be happening. Do you have any log data to
suggest why this is the case.

On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <[email protected]> wrote:

> So I was able to get Nutch up and working using the crawl command.  I set
> my
> depth and topN and it ran and indexed the pages for me.
>
> But not I'm trying to split out the separate pieces in order to distribute
> them and add my own parser.  I'm running the following.
>
> bin/nutch generate crawl/crawldb crawl/segments
> export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> bin/nutch fetch $SEGMENT -noParsing
> bin/nutch parse $SEGMENT
> bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
>
>
> I don't see any way to determine how deep to crawl.  Is this possible, or
> do
> I have to manually manage the db?  And if so, how do I do that?
>
> And as a side note, why does Nutch invoke hadoop during the fetch command
> even though I have noParsing set?  After fetching my links, my machine
> churns for around twenty minutes before finally ending, even though all the
> fetch threads completed already.
>
> Thanks.
>

-- 
*Lewis*

Re: Nutch crawl vs other commands

Reply via email to