Re: Nutch crawl vs other commands

Markus Jelsma Thu, 22 Sep 2011 11:12:29 -0700

> I'm using 1.3.  This is a new setup, so I'm running the latest versions.
> 
> I did inject the urls already.  It's just that the part I was having issues
> with was the fetch, etc.  I'm using the steps at Lucid Imagination » Using
> Nutch with
> Solr<http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/>except
> that I alredy had Nutch set up and configured.
> 
> When did noParsing change?  I noticed that the Nutch wiki is out of date,
> so I'm not sure what the current setups are.


Somewhere in the first 1.x version. Later it became a parse option that 
actually never worked anyway until it was fixed in the current 1.4-dev. Still, 
it's not recommended to parse during the fetch stage.

> 
> The log data made some mention of hadoop, but I don't remember what it was.
> I'll see if it happens again and post the message.

As a mentioned in the other reply, it writes out data:
http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java?view=markup

This will take a while indeed and it won't log anything during its execution.
> 
> On Thu, Sep 22, 2011 at 10:44 AM, lewis john mcgibbney <
> 
> [email protected]> wrote:
> > Hi Bai,
> > 
> > You haven't mentioned which Nutch version you're using... this would be
> > good
> > if you could.
> > 
> > You haven't injected any seed URLs into your crawldb. From memory I think
> > the -topN parameter should be passed to the generate command.
> > 
> > Just to note, it is not necessary to set noParsing while executing the
> > fetch
> > command. This is already default behaviour. Not sure why your machine is
> > churning but this shouldn't be happening. Do you have any log data to
> > suggest why this is the case.
> > 
> > On Thu, Sep 22, 2011 at 1:26 PM, Bai Shen <[email protected]> wrote:
> > > So I was able to get Nutch up and working using the crawl command.  I
> > > set my
> > > depth and topN and it ran and indexed the pages for me.
> > > 
> > > But not I'm trying to split out the separate pieces in order to
> > 
> > distribute
> > 
> > > them and add my own parser.  I'm running the following.
> > > 
> > > bin/nutch generate crawl/crawldb crawl/segments
> > > export SEGMENT=crawl/segments/`ls -tr crawl/segments|tail -1`
> > > bin/nutch fetch $SEGMENT -noParsing
> > > bin/nutch parse $SEGMENT
> > > bin/nutch updatedb crawl/crawldb $SEGMENT -filter -normalize
> > > 
> > > 
> > > I don't see any way to determine how deep to crawl.  Is this possible,
> > > or do
> > > I have to manually manage the db?  And if so, how do I do that?
> > > 
> > > And as a side note, why does Nutch invoke hadoop during the fetch
> > > command even though I have noParsing set?  After fetching my links, my
> > > machine churns for around twenty minutes before finally ending, even
> > > though all
> > 
> > the
> > 
> > > fetch threads completed already.
> > > 
> > > Thanks.
> > 
> > --
> > *Lewis*

Re: Nutch crawl vs other commands

Reply via email to